This analysis examines transparency and accountability challenges in public procurement systems using procurement data. The tool provides a framework for assessing the extent to which regulatory requirements are reflected in actual implementation, focusing on transparency and accountability as key determinants of procurement integrity and corruption risk. Beyond evaluating data availability and accessibility, the analysis proposes tests to identify potential collusive relationships between buyers and suppliers and assesses how transparency affects competition and price overruns.

Is data open, machine readable and not missing any crucial fields?

Platform information

Requires manual filling

Field Value
Platform URL
Year of introduction
Type of access Free / paid / limited

Data fields

Requires manual filling

Field Is available? Type of data
Procurement plans Yes/No Scanned PDFs / xml / html / no information
Call for tenders Yes/No Scanned PDFs / xml / html / no information
Modification or cancellation Yes/No Scanned PDFs / xml / html / no information
Announcement of awards Yes/No Scanned PDFs / xml / html / no information
Details on contract signature Yes/No Scanned PDFs / xml / html / no information
Implementation information Yes/No Scanned PDFs / xml / html / no information
Framework agreement Yes/No Scanned PDFs / xml / html / no information
Direct awards Yes/No Scanned PDFs / xml / html / no information
Subcontractors Yes/No Scanned PDFs / xml / html / no information

Source: National platform website

Country-specific results

Bulgaria (BG)

Basic data summary for BG

Contracts per year:

Tender year Number of contracts
2007 7715
2008 11552
2009 8595
2010 10424
2011 14393
2012 18833
2013 20472
2014 20723
2015 20698
2016 14857
2017 9480
2018 10549
2019 9500
2020 614
NA 6912

Number of unique buyers (buyer_masterid): 30,160

Number of unique bidders (bidder_masterid): 106,703

Number of unique tenders per year:

Tender year Number of unique tenders
2007 4604
2008 7037
2009 3704
2010 3174
2011 4038
2012 7175
2013 9409
2014 9384
2015 8869
2016 7666
2017 6827
2018 7470
2019 6097
2020 365
NA 6912

Variables present (excluding indicator variables):

  • tender_id
  • lot_number
  • bid_number
  • bid_iswinning
  • tender_country
  • tender_awarddecisiondate
  • tender_contractsignaturedate
  • tender_biddeadline
  • tender_proceduretype
  • tender_nationalproceduretype
  • tender_supplytype
  • tender_publications_firstcallfortenderdate
  • notice_url
  • source
  • tender_publications_firstdcontractawarddate
  • tender_publications_lastcontractawardurl
  • buyer_masterid
  • buyer_id
  • buyer_city
  • buyer_postcode
  • buyer_country
  • buyer_nuts
  • buyer_name
  • buyer_buyertype
  • buyer_mainactivities
  • tender_addressofimplementation_country
  • tender_addressofimplementation_nuts
  • bidder_masterid
  • bidder_id
  • bidder_country
  • bidder_nuts
  • bidder_name
  • bid_price
  • bid_priceusd
  • bid_pricecurrency
  • lot_productcode
  • lot_localproductcode_type
  • lot_localproductcode
  • submp
  • decp
  • is_capital
  • lot_title
  • lot_estimatedpriceusd
  • lot_estimatedprice
  • lot_estimatedpricecurrency
  • tender_year
  • has_buyer_id
  • has_bidder_id
  • has_price
  • buyer_group

Are there observable patterns of underreporting information in the data?

The first step is to assess data completeness by examining missing values across all variables or a defined subset. Given the limited number of variables in the ProAct dataset, a full review can be efficiently conducted to identify any variables with significant underreporting. Each variable contributed to one of the ProAct indicators.

Overall Missing Values by Variable – BG

This figure summarises the share of missing values for each key variable in the dataset. For every non-indicator column (all variables that do not start with ‘ind_’), the pipeline computes the mean of an NA indicator across all observations, i.e. the proportion of records where that field is missing. Variables are displayed with human-readable labels (e.g. buyer and bidder IDs, names, locations, prices). Use this plot to identify which core fields are most affected by missingness and may compromise analysis – especially high-missingness identifiers, dates, and price variables.

Missingness by Buyer Type – BG

This heatmap shows how missingness varies across buyer types and variables. Buyers are categorised into ‘National’, ‘Regional’, ‘Utilities’, ‘EU agency’, and ‘Other Public Bodies’ based on the ‘buyer_buyertype’ field. For each buyer group, the pipeline computes the share of missing values for all non-indicator variables and reshapes them into a long format. Darker cells indicate a higher proportion of missing values for a given variable within a buyer group. Pay particular attention to patterns where specific buyer types systematically lack key identifiers (e.g. buyer IDs, names, addresses) or financial fields, as these gaps may signal structural reporting issues.

Missingness Among Top Buyers – BG

This bar chart highlights the top buyers with the highest overall share of missing data. The underlying calculation groups the data by ‘buyer_masterid’ and ‘buyer_buyertype’, keeps only buyers with at least 100 records, and then computes the average share of missing values across all non-indicator variables. Labels show truncated buyer names with buyer type on the next line. Use this figure to spot specific organisations that drive data quality problems: buyers with very high missingness may require targeted engagement or manual cleaning before deeper analysis.

Missingness by Procedure Type – BG

This heatmap displays the share of missing values for each variable, broken down by procurement procedure type. The pipeline recodes ‘tender_proceduretype’ into readable procedural categories (e.g. Open, Restricted, Negotiated with/without publication, Outright award, Competitive dialogue), then computes missingness shares for all non-indicator variables by procedure group and reshapes to long format. Each cell gives the proportion of missing values for a given variable within a procedure type. Look for procedure types where key information (such as buyer and bidder identifiers, contract values, or dates) is systematically missing – this may indicate opacity in less competitive or exceptional procedures.

Missingness by Region (NUTS3) – BG

This choropleth map shows the overall share of missing values by NUTS3 region, where such information is available. The pipeline first cleans ‘buyer_nuts’ codes into valid NUTS3 identifiers, then computes the average share of missing values across all non-indicator variables for each region. These regional missingness rates are then joined to Eurostat NUTS3 geometries for the selected country (year 2021) and plotted. Regions shaded in darker colours have a higher share of missing information in buyer-level or contract-level fields. Pay attention to regions where missingness is systematically high, as this may reflect regional reporting capacity issues or gaps in how certain subnational entities interact with the procurement system.

Can this data be matched to other registers in the country to ensure higher quality monitoring?

The ability to match public procurement data with other registers significantly enhances the analytical power of research and strengthens auditing capabilities. Most importantly, access to company register or beneficial ownership data provides valuable insights into the supplier side—such as how long the firm has been operating, its profitability, the ratio of contract value to revenue, and how closely the procurement sector aligns with the company’s main area of activity.

In addition, linking buyers to relevant information about the government agency can greatly improve analysis on the buyer side, for instance by incorporating data on politically exposed persons or asset declarations. Finally, geospatial information can offer further insights into the location of the company or the procuring agency, enriching both analytical and oversight capacities.

Other types of national data available for matching

Requires manual filling

Data type Availability (access + format) ID type Comments
Company register Open / restricted / paid + file format Tax ID / other national ID / BvD ID Can be used as external validation source
Beneficial ownership data Open / restricted / paid + file format Tax ID / other national ID / BvD ID Can share data per requested list of companies
Asset and interest declarations Open / restricted / paid + file format Name of politician / Personal Tax ID / other Often requires manual collection or scraping

Case-specific analysis – matching rate between data sources

Can be filled when respective data is available

Dataset A Dataset B Identifier Matching rate
Procurement Company register Supplier ID + year 0–100%
Company register PEP data Name 0–100%

Source: Contract-level data; company register records; other micro data

Interoperability potential at organization level

Organization type ID type Missing share
Supplier Source ID 59%
Supplier Generated ID 0%
Supplier Name 0%
Supplier Address (postcode) 65%
Buyer Source ID 61%
Buyer Generated ID 0%
Buyer Name 0%
Buyer Address (postcode) 65%

Do companies winning contracts have risky profiles?

This analysis seeks to identify potentially suspicious patterns in company behavior, including movements between markets or registration in tax haven jurisdictions. With access to company and beneficial ownership register data, additional and more sophisticated indicators could be developed.

Network of Unusual Supplier Market Entries – BG

This network visualisation maps atypical movements of suppliers across CPV product clusters. Each node represents a 3-digit CPV cluster, i.e. a broad product or service market derived from ‘lot_productcode’. For every bidder, a ‘home’ CPV cluster is defined as the cluster where the bidder wins the most awards. Entries into other clusters are treated as ‘target’ markets. A bidder–cluster combination is flagged as atypical if the bidder has enough overall history, but that cluster accounts for less than 5% of their awards and at most three wins there. For each bidder–cluster pair, the pipeline computes a Laplace-smoothed probability of observing that bidder in that cluster, p(i|c) = (n_ic + 1) / (n_c + n_suppliers_c), where n_ic is the number of contracts for bidder i in cluster c, n_c is the total number of contracts in c, and n_suppliers_c is the number of distinct suppliers in c. The surprise score is then defined as -log(p(i|c)) and standardised to a z-score; higher values indicate that the bidder is unusually rare in that cluster given its overall structure. These atypical entries are aggregated at the level of home CPV cluster (source) and target CPV cluster (destination). Directed edges are drawn only for pairs of clusters where at least four distinct bidders make such atypical entries. Edge width reflects the number of bidders involved (n_bidders), and edge colour encodes the average standardised surprise score: thicker and redder edges mark flows where many suppliers enter a target market in a way that is collectively more unusual than average. Node positions are based on a force-directed layout (Fruchterman–Reingold) and help visually cluster markets that are strongly connected by unusual flows. When interpreting the figure, focus on CPV clusters with multiple thick, high-surprise edges—either as sources or as targets—as these markets may concentrate non-standard or risky supplier behaviour that merits further qualitative investigation.

Suppliers with Unusually Diversified Market Entries – BG

Each bar represents a supplier flagged as behaving unusually across CPV clusters. For every bidder, the pipeline looks at all atypical CPV entries (those with high surprise z-scores) and records: the maximum surprise z-score attained across these entries, and the number of distinct target markets entered atypically. Bar height shows the maximum surprise z-score, while bar colour (if used) or additional labelling can indicate how many distinct markets each supplier enters in an unusual way. Use this figure to identify suppliers whose market expansion patterns are particularly non-standard and may warrant closer review.

Markets Attracting Unusual Supplier Entries – BG

Each point represents a 3-digit CPV cluster (a broad product or service market). For each target cluster, the pipeline aggregates all atypical entries into that market and computes: (i) the number of distinct suppliers entering atypically (x-axis), (ii) the mean standardised surprise score among those entries (y-axis), and (iii) the number of atypical awards (bubble size). The resulting scatter/bubble plot highlights markets where many suppliers behave unusually (high x), where entries are particularly surprising on average (high y), or both. Clusters in the upper-right with large bubbles are markets that attract many highly atypical entries and may be especially relevant for risk-based follow-up.

Are there suspicious connections between buyers and suppliers?

This analysis examines the potential existence of suspicious connections between buyers and suppliers. It can greatly benefit from access to additional registers—such as politically exposed persons (PEP) lists, interest or asset declaration databases, and beneficial ownership (BO) registers—as these can reveal direct links between suppliers and buyers.

Top Buyers by Supplier Concentration Over Time – BG

This faceted bar chart shows, for each year, the buyers with the highest concentration of spending on a single supplier. The pipeline groups contracts by ‘tender_year’, ‘buyer_masterid’, and ‘bidder_masterid’, computes total spend per buyer–supplier pair, and then derives, for each buyer-year, the share of spending allocated to each supplier (buyer_concentration). For buyers with at least three suppliers, the maximum of this share is used as the concentration indicator, and the top 30 buyer–year combinations per year are retained for plotting. Bars represent the maximum concentration for each buyer in a given year, labels show the number of contracts, and buyers that appear in the top list in more than one year are highlighted in red. High and persistent concentration (especially repeated red buyers across years) may signal stable and potentially problematic dependencies on single suppliers that merit closer investigation.

Is there an observable effect of transparency and integrity issues on prices and competition?

While this analysis does not establish any causal relationships, it can be used to explore correlations between transparency-related measures, prices, and competitiveness.

Predicted Single-Bidding by Missing Share – BG

This figure summarises the estimated relationship between data missingness and the prevalence of single-bid tenders. At the buyer–year level (for years after 2014), the pipeline computes: (i) a cumulative missing-share indicator as the average of all variable-specific missing shares, and (ii) the average share of single-bid tenders using the ‘ind_corr_singleb’ indicator (rescaled to [0,1]). A fractional logit model with year fixed effects regresses the buyer-year single-bidding share on cumulative missingness, controlling for the number of contracts and average contract value, and clustering standard errors by buyer type. The line shows the predicted single-bidding share as missingness increases; the shaded band is the 95% confidence interval. Interpret an upward slope as evidence that poorer transparency and data completeness are associated with less competition, while paying attention to the uncertainty band and the range of missingness actually observed in the data.

Figure not produced: no information or key variable is missing for the plot.

Sensitivity Analysis: Single-Bidding Model – BG

This table summarizes the robustness of the single-bidding analysis across different model specifications. The pipeline runs multiple versions of the fractional logit model with varying combinations of: fixed effects (year FE, buyer type FE, or both), clustering (by buyer, buyer type, or none), and control variables (contract count, average value, or both). Each row shows results for one specification, including the coefficient estimate, standard error, p-value, and significance stars. Use this table to verify that the core finding about missingness and single-bidding holds across reasonable modeling choices and is not driven by a single specification.

What is sensitivity analysis?

Sensitivity analysis tests whether the relationship between missing data and the outcome (single-bidding or relative prices) holds across different modeling choices. A robust finding should be consistent across reasonable variations in:

  • Fixed effects (FE): Controls for unobserved differences across buyers, years, etc.
  • Clustering: Adjusts standard errors for correlation within groups
  • Control variables: Additional factors that might affect the outcome

Overall Summary:

n_specs share_positive share_negative median_estimate median_pvalue median_strength share_p_le_0.05 share_p_le_0.1 share_p_le_0.2 share_sign_stable n_nonzero
45 0 1 -11.246 0 -0.662 1 1 1 1 45

How to interpret:

  • n_specs (45): Number of alternative model specifications tested

  • share_positive (0.0%): Share of models where missingness is associated with a higher outcome

  • share_negative (100.0%): Share of models where missingness is associated with a lower outcome

  • median_estimate (-11.246): Typical effect size across models (negative implies a negative association)

  • median_strength (-0.662): Median standardized strength (closer to 0 = weaker)

  • median_pvalue (0.000): Median p-value across all specifications

  • share_p_le_0.05 (100.0%): Share of models significant at p ≤ 0.05

  • share_p_le_0.10 (100.0%): Share of models significant at p ≤ 0.10

  • share_p_le_0.20 (100.0%): Share of models significant at p ≤ 0.20 (lenient threshold)

  • ⚠ Moderate evidence: Direction is mostly consistent, but statistical significance varies across specifications

Show detailed breakdown by model components

By Fixed Effects (FE):

Fixed effects control for unobserved heterogeneity. For example: - Buyer FE: Controls for buyer-specific characteristics (e.g., capacity, procurement culture) - Year FE: Controls for time trends (e.g., economic conditions, regulatory changes) - Buyer + Year FE: Controls for both simultaneously

fe n_specs share_positive share_p10 median_p median_strength
buyer 15 0 1 0 -0.662
buyer+year 15 0 1 0 -0.780
year 15 0 1 0 -0.286

Look for: Which fixed effect structure gives the most significant results? If results hold across all FE specifications, the finding is very robust.

By Clustering:

Clustering adjusts standard errors for correlation within groups: - Buyer clustering: Accounts for correlation within the same buyer over time - Year clustering: Accounts for common shocks affecting all buyers in a year - None: Assumes all observations are independent (most restrictive)

cluster n_specs share_positive share_p10 median_p median_strength
none 9 0 1 0 -0.662
buyer 9 0 1 0 -0.662
year 9 0 1 0 -0.662
buyer_year 9 0 1 0 -0.662
buyer_buyertype 9 0 1 0 -0.662

Look for: If results remain significant even with buyer clustering (most conservative), the evidence is strong.

By Controls:

Control variables are additional factors that might affect the outcome: - x_only: No controls (only missingness and fixed effects) - base: Core controls (contract value, buyer type, procedure type) - base_extra: Extended controls (additional economic/organizational factors)

controls n_specs share_positive share_p10 median_p median_strength
base 15 0 1 0 -0.662
base_extra 15 0 1 0 -0.662
x_only 15 0 1 0 -0.662

Look for: Does the relationship hold even with rich controls? If yes, it’s less likely to be driven by omitted variables.

Classification of Model Results:

This table shows how many models fall into each category:

class n share
Negative & significant 45 1

Ideal: Most models should be ‘Positive & significant’. If many are ‘Negative & significant’ or evenly split, the evidence is inconclusive.

Top Model Specifications (by success rate):

These are the combinations of fixed effects, clustering, and controls that most consistently produce significant results:

fe cluster controls n share_pok median_p median_est
year none base_extra 1 1 0 -6.312
year none base 1 1 0 -4.547
year none x_only 1 1 0 -4.547
buyer+year none base_extra 1 1 0 -14.631
buyer+year none base 1 1 0 -14.631
buyer+year none x_only 1 1 0 -14.631
buyer none base 1 1 0 -11.246
buyer none x_only 1 1 0 -11.246
buyer none base_extra 1 1 0 -11.246
buyer+year buyer base_extra 1 1 0 -14.631

Look for: Combinations with high share_pok (share of p-values ≤ 0.10). These are the most reliable specification choices for this dataset.


Summary for Decision-Makers:

Moderate evidence. The relationship is mostly in one direction, but statistical significance varies across specifications.

Key statistics: - 0% of models are positive - 100% of models are negative - 100% of models are statistically significant (p ≤ 0.10) - Median effect size: -11.246 - Median p-value: 0.000

Predicted Relative Price by Missing Share – BG

This figure shows how relative prices correlate with overall missingness after controlling for key factors. For contracts with valid price information (bid price and estimated lot value), the pipeline computes ‘relative_price’ as bid_price / lot_estimatedprice and trims extreme or non-positive values. It then constructs ‘total_missing_share’ as the average of NA indicators across a predefined set of core variables, and estimates a linear fixed-effects model of relative_price on total_missing_share, controlling for procedure type, buyer type, and year (with year fixed effects and clustering by buyer). The line displays predicted relative prices as missingness increases, with a shaded 95% confidence interval. An upward relationship suggests that higher missingness – a proxy for weaker transparency and data quality – may be associated with higher prices relative to estimates; a flat line would indicate no clear price effect.

Sensitivity Analysis: Relative Price Model – BG

This table summarizes the robustness of the relative price analysis across different model specifications. The pipeline runs multiple versions of the linear fixed-effects model with varying combinations of: fixed effects (year FE, procedure type FE, buyer type FE, or combinations), clustering (by buyer, buyer type, or none), and control variables. Each row shows results for one specification, including the coefficient estimate, standard error, p-value, and significance stars. Use this table to assess whether the relationship between missingness and relative prices is consistent across different modeling assumptions or sensitive to specific controls and fixed effects.

What is sensitivity analysis?

Sensitivity analysis tests whether the relationship between missing data and the outcome (single-bidding or relative prices) holds across different modeling choices. A robust finding should be consistent across reasonable variations in:

  • Fixed effects (FE): Controls for unobserved differences across buyers, years, etc.
  • Clustering: Adjusts standard errors for correlation within groups
  • Control variables: Additional factors that might affect the outcome

Overall Summary:

n_specs share_positive share_negative median_estimate median_pvalue median_strength share_p_le_0.05 share_p_le_0.1 share_p_le_0.2 share_sign_stable n_nonzero
144 0.278 0.722 -0.255 0.211 -0.03 0.34 0.417 0.493 0 144

How to interpret:

  • n_specs (144): Number of alternative model specifications tested

  • share_positive (27.8%): Share of models where missingness is associated with a higher outcome

  • share_negative (72.2%): Share of models where missingness is associated with a lower outcome

  • median_estimate (-0.255): Typical effect size across models (negative implies a negative association)

  • median_strength (-0.030): Median standardized strength (closer to 0 = weaker)

  • median_pvalue (0.211): Median p-value across all specifications

  • share_p_le_0.05 (34.0%): Share of models significant at p ≤ 0.05

  • share_p_le_0.10 (41.7%): Share of models significant at p ≤ 0.10

  • share_p_le_0.20 (49.3%): Share of models significant at p ≤ 0.20 (lenient threshold)

  • ⚠ Moderate evidence: Direction is mostly consistent, but statistical significance varies across specifications

Show detailed breakdown by model components

By Fixed Effects (FE):

Fixed effects control for unobserved heterogeneity. For example: - Buyer FE: Controls for buyer-specific characteristics (e.g., capacity, procurement culture) - Year FE: Controls for time trends (e.g., economic conditions, regulatory changes) - Buyer + Year FE: Controls for both simultaneously

fe n_specs share_positive share_p10 median_p median_strength
buyer#year 30 0.000 0.733 0.049 -0.083
year 30 0.500 0.533 0.017 -0.002
0 24 0.500 0.458 0.137 -0.019
buyer 30 0.233 0.333 0.646 -0.017
buyer+year 30 0.200 0.033 0.772 -0.005

Look for: Which fixed effect structure gives the most significant results? If results hold across all FE specifications, the finding is very robust.

By Clustering:

Clustering adjusts standard errors for correlation within groups: - Buyer clustering: Accounts for correlation within the same buyer over time - Year clustering: Accounts for common shocks affecting all buyers in a year - None: Assumes all observations are independent (most restrictive)

cluster n_specs share_positive share_p10 median_p median_strength
none 24 0.208 0.667 0.004 -0.034
buyer 30 0.267 0.467 0.155 -0.034
year 30 0.267 0.400 0.254 -0.034
buyer_year 30 0.267 0.267 0.255 -0.034
buyer_buyertype 30 0.367 0.333 0.410 -0.014

Look for: If results remain significant even with buyer clustering (most conservative), the evidence is strong.

By Controls:

Control variables are additional factors that might affect the outcome: - x_only: No controls (only missingness and fixed effects) - base: Core controls (contract value, buyer type, procedure type) - base_extra: Extended controls (additional economic/organizational factors)

controls n_specs share_positive share_p10 median_p median_strength
x_only 72 0.042 0.611 0.038 -0.050
base 72 0.514 0.222 0.541 0.004

Look for: Does the relationship hold even with rich controls? If yes, it’s less likely to be driven by omitted variables.

Classification of Model Results:

This table shows how many models fall into each category:

class n share
Negative & significant 57 0.396
Negative but insignificant 47 0.326
Positive & significant 3 0.021
Positive but insignificant 37 0.257

Ideal: Most models should be ‘Positive & significant’. If many are ‘Negative & significant’ or evenly split, the evidence is inconclusive.

Top Model Specifications (by success rate):

These are the combinations of fixed effects, clustering, and controls that most consistently produce significant results:

fe cluster controls n share_pok median_p median_est
year none x_only 3 1 0.000 -0.813
0 buyer x_only 3 1 0.000 -0.907
year buyer x_only 3 1 0.000 -0.813
buyer none x_only 3 1 0.000 -0.423
0 buyer_buyertype x_only 3 1 0.000 -1.325
year buyer_buyertype x_only 3 1 0.000 -1.443
buyer#year none base 3 1 0.001 -0.917
buyer#year buyer_buyertype base 3 1 0.001 -0.917
buyer#year none x_only 3 1 0.002 -0.427
0 year x_only 3 1 0.004 -0.907

Look for: Combinations with high share_pok (share of p-values ≤ 0.10). These are the most reliable specification choices for this dataset.


Summary for Decision-Makers:

Moderate evidence. The relationship is mostly in one direction, but statistical significance varies across specifications.

Key statistics: - 28% of models are positive - 72% of models are negative - 42% of models are statistically significant (p ≤ 0.10) - Median effect size: -0.255 - Median p-value: 0.211

Uruguay (UY)

Basic data summary for UY

Contracts per year:

Tender year Number of contracts
2014 88
2015 57262
2016 71525
2017 85379
2018 46791

Number of unique buyers (buyer_masterid): 268

Number of unique bidders (bidder_masterid): 11,822

Number of unique tenders per year:

Tender year Number of unique tenders
2014 45
2015 47122
2016 58670
2017 65622
2018 37364

Variables present (excluding indicator variables):

  • tender_id
  • lot_number
  • bid_number
  • tender_country
  • tender_awarddecisiondate
  • tender_biddeadline
  • tender_proceduretype
  • tender_nationalproceduretype
  • tender_supplytype
  • tender_publications_firstcallfortenderdate
  • source
  • buyer_masterid
  • buyer_id
  • buyer_name
  • buyer_buyertype
  • bidder_masterid
  • bidder_id
  • bidder_country
  • bidder_name
  • bid_price
  • bid_priceusd
  • bid_pricecurrency
  • lot_productcode
  • lot_localproductcode_type
  • lot_localproductcode
  • submp
  • decp
  • is_capital
  • lot_title
  • tender_year
  • has_buyer_id
  • has_bidder_id
  • has_price
  • buyer_group

Are there observable patterns of underreporting information in the data?

The first step is to assess data completeness by examining missing values across all variables or a defined subset. Given the limited number of variables in the ProAct dataset, a full review can be efficiently conducted to identify any variables with significant underreporting. Each variable contributed to one of the ProAct indicators.

Overall Missing Values by Variable – UY

This figure summarises the share of missing values for each key variable in the dataset. For every non-indicator column (all variables that do not start with ‘ind_’), the pipeline computes the mean of an NA indicator across all observations, i.e. the proportion of records where that field is missing. Variables are displayed with human-readable labels (e.g. buyer and bidder IDs, names, locations, prices). Use this plot to identify which core fields are most affected by missingness and may compromise analysis – especially high-missingness identifiers, dates, and price variables.

Missingness by Buyer Type – UY

This heatmap shows how missingness varies across buyer types and variables. Buyers are categorised into ‘National’, ‘Regional’, ‘Utilities’, ‘EU agency’, and ‘Other Public Bodies’ based on the ‘buyer_buyertype’ field. For each buyer group, the pipeline computes the share of missing values for all non-indicator variables and reshapes them into a long format. Darker cells indicate a higher proportion of missing values for a given variable within a buyer group. Pay particular attention to patterns where specific buyer types systematically lack key identifiers (e.g. buyer IDs, names, addresses) or financial fields, as these gaps may signal structural reporting issues.

Missingness Among Top Buyers – UY

This bar chart highlights the top buyers with the highest overall share of missing data. The underlying calculation groups the data by ‘buyer_masterid’ and ‘buyer_buyertype’, keeps only buyers with at least 100 records, and then computes the average share of missing values across all non-indicator variables. Labels show truncated buyer names with buyer type on the next line. Use this figure to spot specific organisations that drive data quality problems: buyers with very high missingness may require targeted engagement or manual cleaning before deeper analysis.

Missingness by Procedure Type – UY

This heatmap displays the share of missing values for each variable, broken down by procurement procedure type. The pipeline recodes ‘tender_proceduretype’ into readable procedural categories (e.g. Open, Restricted, Negotiated with/without publication, Outright award, Competitive dialogue), then computes missingness shares for all non-indicator variables by procedure group and reshapes to long format. Each cell gives the proportion of missing values for a given variable within a procedure type. Look for procedure types where key information (such as buyer and bidder identifiers, contract values, or dates) is systematically missing – this may indicate opacity in less competitive or exceptional procedures.

Missingness by Region (NUTS3) – UY

This choropleth map shows the overall share of missing values by NUTS3 region, where such information is available. The pipeline first cleans ‘buyer_nuts’ codes into valid NUTS3 identifiers, then computes the average share of missing values across all non-indicator variables for each region. These regional missingness rates are then joined to Eurostat NUTS3 geometries for the selected country (year 2021) and plotted. Regions shaded in darker colours have a higher share of missing information in buyer-level or contract-level fields. Pay attention to regions where missingness is systematically high, as this may reflect regional reporting capacity issues or gaps in how certain subnational entities interact with the procurement system.

Figure not produced: no information or key variable is missing for the plot.

Can this data be matched to other registers in the country to ensure higher quality monitoring?

The ability to match public procurement data with other registers significantly enhances the analytical power of research and strengthens auditing capabilities. Most importantly, access to company register or beneficial ownership data provides valuable insights into the supplier side—such as how long the firm has been operating, its profitability, the ratio of contract value to revenue, and how closely the procurement sector aligns with the company’s main area of activity.

In addition, linking buyers to relevant information about the government agency can greatly improve analysis on the buyer side, for instance by incorporating data on politically exposed persons or asset declarations. Finally, geospatial information can offer further insights into the location of the company or the procuring agency, enriching both analytical and oversight capacities.

Other types of national data available for matching

Requires manual filling

Data type Availability (access + format) ID type Comments
Company register Open / restricted / paid + file format Tax ID / other national ID / BvD ID Can be used as external validation source
Beneficial ownership data Open / restricted / paid + file format Tax ID / other national ID / BvD ID Can share data per requested list of companies
Asset and interest declarations Open / restricted / paid + file format Name of politician / Personal Tax ID / other Often requires manual collection or scraping

Case-specific analysis – matching rate between data sources

Can be filled when respective data is available

Dataset A Dataset B Identifier Matching rate
Procurement Company register Supplier ID + year 0–100%
Company register PEP data Name 0–100%

Source: Contract-level data; company register records; other micro data

Interoperability potential at organization level

Organization type ID type Missing share
Supplier Source ID 0%
Supplier Generated ID 0%
Supplier Name 0%
Supplier Address (postcode) Not available (variable missing in dataset)
Buyer Source ID 0%
Buyer Generated ID 0%
Buyer Name 0%
Buyer Address (postcode) Not available (variable missing in dataset)

Do companies winning contracts have risky profiles?

This analysis seeks to identify potentially suspicious patterns in company behavior, including movements between markets or registration in tax haven jurisdictions. With access to company and beneficial ownership register data, additional and more sophisticated indicators could be developed.

Network of Unusual Supplier Market Entries – UY

This network visualisation maps atypical movements of suppliers across CPV product clusters. Each node represents a 3-digit CPV cluster, i.e. a broad product or service market derived from ‘lot_productcode’. For every bidder, a ‘home’ CPV cluster is defined as the cluster where the bidder wins the most awards. Entries into other clusters are treated as ‘target’ markets. A bidder–cluster combination is flagged as atypical if the bidder has enough overall history, but that cluster accounts for less than 5% of their awards and at most three wins there. For each bidder–cluster pair, the pipeline computes a Laplace-smoothed probability of observing that bidder in that cluster, p(i|c) = (n_ic + 1) / (n_c + n_suppliers_c), where n_ic is the number of contracts for bidder i in cluster c, n_c is the total number of contracts in c, and n_suppliers_c is the number of distinct suppliers in c. The surprise score is then defined as -log(p(i|c)) and standardised to a z-score; higher values indicate that the bidder is unusually rare in that cluster given its overall structure. These atypical entries are aggregated at the level of home CPV cluster (source) and target CPV cluster (destination). Directed edges are drawn only for pairs of clusters where at least four distinct bidders make such atypical entries. Edge width reflects the number of bidders involved (n_bidders), and edge colour encodes the average standardised surprise score: thicker and redder edges mark flows where many suppliers enter a target market in a way that is collectively more unusual than average. Node positions are based on a force-directed layout (Fruchterman–Reingold) and help visually cluster markets that are strongly connected by unusual flows. When interpreting the figure, focus on CPV clusters with multiple thick, high-surprise edges—either as sources or as targets—as these markets may concentrate non-standard or risky supplier behaviour that merits further qualitative investigation.

Suppliers with Unusually Diversified Market Entries – UY

Each bar represents a supplier flagged as behaving unusually across CPV clusters. For every bidder, the pipeline looks at all atypical CPV entries (those with high surprise z-scores) and records: the maximum surprise z-score attained across these entries, and the number of distinct target markets entered atypically. Bar height shows the maximum surprise z-score, while bar colour (if used) or additional labelling can indicate how many distinct markets each supplier enters in an unusual way. Use this figure to identify suppliers whose market expansion patterns are particularly non-standard and may warrant closer review.

Markets Attracting Unusual Supplier Entries – UY

Each point represents a 3-digit CPV cluster (a broad product or service market). For each target cluster, the pipeline aggregates all atypical entries into that market and computes: (i) the number of distinct suppliers entering atypically (x-axis), (ii) the mean standardised surprise score among those entries (y-axis), and (iii) the number of atypical awards (bubble size). The resulting scatter/bubble plot highlights markets where many suppliers behave unusually (high x), where entries are particularly surprising on average (high y), or both. Clusters in the upper-right with large bubbles are markets that attract many highly atypical entries and may be especially relevant for risk-based follow-up.

Are there suspicious connections between buyers and suppliers?

This analysis examines the potential existence of suspicious connections between buyers and suppliers. It can greatly benefit from access to additional registers—such as politically exposed persons (PEP) lists, interest or asset declaration databases, and beneficial ownership (BO) registers—as these can reveal direct links between suppliers and buyers.

Top Buyers by Supplier Concentration Over Time – UY

This faceted bar chart shows, for each year, the buyers with the highest concentration of spending on a single supplier. The pipeline groups contracts by ‘tender_year’, ‘buyer_masterid’, and ‘bidder_masterid’, computes total spend per buyer–supplier pair, and then derives, for each buyer-year, the share of spending allocated to each supplier (buyer_concentration). For buyers with at least three suppliers, the maximum of this share is used as the concentration indicator, and the top 30 buyer–year combinations per year are retained for plotting. Bars represent the maximum concentration for each buyer in a given year, labels show the number of contracts, and buyers that appear in the top list in more than one year are highlighted in red. High and persistent concentration (especially repeated red buyers across years) may signal stable and potentially problematic dependencies on single suppliers that merit closer investigation.

Is there an observable effect of transparency and integrity issues on prices and competition?

While this analysis does not establish any causal relationships, it can be used to explore correlations between transparency-related measures, prices, and competitiveness.

Predicted Single-Bidding by Missing Share – UY

This figure summarises the estimated relationship between data missingness and the prevalence of single-bid tenders. At the buyer–year level (for years after 2014), the pipeline computes: (i) a cumulative missing-share indicator as the average of all variable-specific missing shares, and (ii) the average share of single-bid tenders using the ‘ind_corr_singleb’ indicator (rescaled to [0,1]). A fractional logit model with year fixed effects regresses the buyer-year single-bidding share on cumulative missingness, controlling for the number of contracts and average contract value, and clustering standard errors by buyer type. The line shows the predicted single-bidding share as missingness increases; the shaded band is the 95% confidence interval. Interpret an upward slope as evidence that poorer transparency and data completeness are associated with less competition, while paying attention to the uncertainty band and the range of missingness actually observed in the data.

Sensitivity Analysis: Single-Bidding Model – UY

This table summarizes the robustness of the single-bidding analysis across different model specifications. The pipeline runs multiple versions of the fractional logit model with varying combinations of: fixed effects (year FE, buyer type FE, or both), clustering (by buyer, buyer type, or none), and control variables (contract count, average value, or both). Each row shows results for one specification, including the coefficient estimate, standard error, p-value, and significance stars. Use this table to verify that the core finding about missingness and single-bidding holds across reasonable modeling choices and is not driven by a single specification.

What is sensitivity analysis?

Sensitivity analysis tests whether the relationship between missing data and the outcome (single-bidding or relative prices) holds across different modeling choices. A robust finding should be consistent across reasonable variations in:

  • Fixed effects (FE): Controls for unobserved differences across buyers, years, etc.
  • Clustering: Adjusts standard errors for correlation within groups
  • Control variables: Additional factors that might affect the outcome

Overall Summary:

n_specs share_positive share_negative median_estimate median_pvalue median_strength share_p_le_0.05 share_p_le_0.1 share_p_le_0.2 share_sign_stable n_nonzero
45 0.667 0.333 0.806 0.344 0.024 0.111 0.178 0.356 0 45

How to interpret:

  • n_specs (45): Number of alternative model specifications tested

  • share_positive (66.7%): Share of models where missingness is associated with a higher outcome

  • share_negative (33.3%): Share of models where missingness is associated with a lower outcome

  • median_estimate (0.806): Typical effect size across models (negative implies a negative association)

  • median_strength (0.024): Median standardized strength (closer to 0 = weaker)

  • median_pvalue (0.344): Median p-value across all specifications

  • share_p_le_0.05 (11.1%): Share of models significant at p ≤ 0.05

  • share_p_le_0.10 (17.8%): Share of models significant at p ≤ 0.10

  • share_p_le_0.20 (35.6%): Share of models significant at p ≤ 0.20 (lenient threshold)

  • ✗ No robust relationship: Results are mixed or highly sensitive to model specification

Show detailed breakdown by model components

By Fixed Effects (FE):

Fixed effects control for unobserved heterogeneity. For example: - Buyer FE: Controls for buyer-specific characteristics (e.g., capacity, procurement culture) - Year FE: Controls for time trends (e.g., economic conditions, regulatory changes) - Buyer + Year FE: Controls for both simultaneously

fe n_specs share_positive share_p10 median_p median_strength
buyer 15 1 0.400 0.168 0.049
year 15 0 0.133 0.344 -0.062
buyer+year 15 1 0.000 0.626 0.024

Look for: Which fixed effect structure gives the most significant results? If results hold across all FE specifications, the finding is very robust.

By Clustering:

Clustering adjusts standard errors for correlation within groups: - Buyer clustering: Accounts for correlation within the same buyer over time - Year clustering: Accounts for common shocks affecting all buyers in a year - None: Assumes all observations are independent (most restrictive)

cluster n_specs share_positive share_p10 median_p median_strength
none 9 0.667 0.556 0.046 0.024
buyer 9 0.667 0.333 0.111 0.024
buyer_buyertype 9 0.667 0.000 0.313 0.024
year 9 0.667 0.000 0.344 0.024
buyer_year 9 0.667 0.000 0.379 0.024

Look for: If results remain significant even with buyer clustering (most conservative), the evidence is strong.

By Controls:

Control variables are additional factors that might affect the outcome: - x_only: No controls (only missingness and fixed effects) - base: Core controls (contract value, buyer type, procedure type) - base_extra: Extended controls (additional economic/organizational factors)

controls n_specs share_positive share_p10 median_p median_strength
base 15 0.667 0.200 0.313 0.024
x_only 15 0.667 0.200 0.313 0.024
base_extra 15 0.667 0.133 0.426 0.024

Look for: Does the relationship hold even with rich controls? If yes, it’s less likely to be driven by omitted variables.

Classification of Model Results:

This table shows how many models fall into each category:

class n share
Negative & significant 2 0.044
Negative but insignificant 13 0.289
Positive & significant 6 0.133
Positive but insignificant 24 0.533

Ideal: Most models should be ‘Positive & significant’. If many are ‘Negative & significant’ or evenly split, the evidence is inconclusive.

Top Model Specifications (by success rate):

These are the combinations of fixed effects, clustering, and controls that most consistently produce significant results:

fe cluster controls n share_pok median_p median_est
year none base 1 1 0.017 -1.735
year none x_only 1 1 0.017 -1.735
buyer none base 1 1 0.046 1.415
buyer none x_only 1 1 0.046 1.415
buyer none base_extra 1 1 0.046 1.415
buyer buyer base 1 1 0.096 1.415
buyer buyer x_only 1 1 0.096 1.415
buyer buyer base_extra 1 1 0.096 1.415
year buyer base 1 0 0.111 -1.735
year buyer x_only 1 0 0.111 -1.735

Look for: Combinations with high share_pok (share of p-values ≤ 0.10). These are the most reliable specification choices for this dataset.


Summary for Decision-Makers:

No robust relationship. Results are mixed or highly sensitive to model specification.

Key statistics: - 67% of models are positive - 33% of models are negative - 18% of models are statistically significant (p ≤ 0.10) - Median effect size: 0.806 - Median p-value: 0.344

Predicted Relative Price by Missing Share – UY

This figure shows how relative prices correlate with overall missingness after controlling for key factors. For contracts with valid price information (bid price and estimated lot value), the pipeline computes ‘relative_price’ as bid_price / lot_estimatedprice and trims extreme or non-positive values. It then constructs ‘total_missing_share’ as the average of NA indicators across a predefined set of core variables, and estimates a linear fixed-effects model of relative_price on total_missing_share, controlling for procedure type, buyer type, and year (with year fixed effects and clustering by buyer). The line displays predicted relative prices as missingness increases, with a shaded 95% confidence interval. An upward relationship suggests that higher missingness – a proxy for weaker transparency and data quality – may be associated with higher prices relative to estimates; a flat line would indicate no clear price effect.

Figure not produced: no information or key variable is missing for the plot.

Sensitivity bundle not available.

Show underlying code
# ========================================================================
# PROCUREMENT INTEGRITY ANALYSIS PIPELINE
# Version: 2.0
# Refactored for consistency, efficiency, and best practices
# ========================================================================

# ========================================================================
# PART 1: CONFIGURATION MANAGEMENT
# ========================================================================

#' Create pipeline configuration
#' 
#' @param country_code Two-letter country code
#' @return List of configuration parameters
create_pipeline_config <- function(country_code) {
  list(
    country = toupper(country_code),
    
    # Analysis thresholds
    thresholds = list(
      min_buyer_contracts = 100,
      min_suppliers_for_buyer_conc = 3,
      min_buyer_years = 3,
      cpv_digits = 3,
      min_bidders_for_edge = 4,
      top_n_buyers = 30,
      top_n_suppliers = 30,
      top_n_markets = 30,
      top_n_vars = 10,
      marginal_share_threshold = 0.05,
      max_wins_atypical = 3,
      min_history_threshold = 4,
      max_relative_price = 5,
      min_relative_price = 0
    ),
    
    # Year filters by component
    years = get_year_range(country_code, "default"),
    years_singleb = get_year_range(country_code, "singleb"),
    years_relprice = get_year_range(country_code, "rel_price"),
    
    # Model settings
    models = list(
      p_max = 0.10,
      fe_set = c("buyer", "year", "buyer+year", "buyer#year"),
      cluster_set = c("none", "buyer", "year", "buyer_year", "buyer_buyertype"),
      controls_set = c("x_only", "base", "base_extra"),
      model_types_relprice = c("ols_level", "ols_log", "gamma_log")
    ),
    
    # Plot settings
    plots = list(
      width = 10,
      height = 6,
      width_large = 12,
      height_large = 12,
      dpi = 300,
      base_size = 14
    )
  )
}

#' Year filter configuration for each country
year_filter_config <- tibble::tribble(
  ~component,  ~country_code, ~min_year, ~max_year,
  # default catch-all
  "default",   "BG",          NA,        NA,
  "default",   "UY",          NA,        NA,
  "default",   "UG",          NA,        NA,
  # component-specific overrides
  "singleb",   "BG",          2011,      2018,
  "singleb",   "UY",          2014,      NA,
  "singleb",   "UG",          NA,        NA,
  "rel_price", "BG",          2011,      2018,
  "rel_price", "UY",          2014,      NA,
  "rel_price", "UG",          NA,        NA
)

#' Get year range for specific component and country
#' 
#' @param country_code Two-letter country code
#' @param component One of "singleb", "rel_price", "default"
#' @return List with min_year and max_year
get_year_range <- function(country_code,
                           component = c("singleb", "rel_price", "default")) {
  component <- match.arg(component)
  cc <- toupper(country_code)
  
  # Try component-specific rule
  row_spec <- year_filter_config %>%
    dplyr::filter(component == !!component, country_code == !!cc) %>%
    dplyr::slice_head(n = 1)
  
  # Fall back to default for that country
  if (nrow(row_spec) == 0) {
    row_spec <- year_filter_config %>%
      dplyr::filter(component == "default", country_code == !!cc) %>%
      dplyr::slice_head(n = 1)
  }
  
  # If still nothing, no filtering
  if (nrow(row_spec) == 0) {
    return(list(min_year = -Inf, max_year = Inf))
  }
  
  min_y <- if (is.na(row_spec$min_year)) -Inf else row_spec$min_year
  max_y <- if (is.na(row_spec$max_year)) Inf else row_spec$max_year
  
  list(min_year = min_y, max_year = max_y)
}

# ========================================================================
# PART 2: LABEL DEFINITIONS AND LOOKUPS
# ========================================================================

#' Variable label lookup for display
label_lookup <- c(
  tender_id_missing_share = "Tender ID",
  tender_year_missing_share = "Tender Year",
  lot_number_missing_share = "Lot Number",
  bid_number_missing_share = "Number of Bids",
  bid_iswinning_missing_share = "Winning Bid",
  tender_country_missing_share = "Tender Country",
  tender_awarddecisiondate_missing_share = "Award Decision Date",
  tender_contractsignaturedate_missing_share = "Contract Signature Date",
  tender_biddeadline_missing_share = "Bid Deadline",
  tender_proceduretype_missing_share = "Procedure Type",
  tender_nationalproceduretype_missing_share = "National Procedure Type",
  tender_supplytype_missing_share = "Supply Type",
  tender_publications_firstcallfortenderdate_missing_share = "First Call for Tender Date",
  notice_url_missing_share = "Notice URL",
  source_missing_share = "Source",
  tender_publications_firstdcontractawarddate_missing_share = "First Contract Award Date",
  tender_publications_lastcontractawardurl_missing_share = "Last Contract Award URL",
  buyer_masterid_missing_share = "Buyer Master ID",
  buyer_id_missing_share = "Buyer ID",
  buyer_city_missing_share = "Buyer City",
  buyer_postcode_missing_share = "Buyer Postcode",
  buyer_country_missing_share = "Buyer Country",
  buyer_nuts_missing_share = "Buyer NUTS",
  buyer_name_missing_share = "Buyer Name",
  buyer_buyertype_missing_share = "Buyer Type",
  buyer_mainactivities_missing_share = "Buyer Main Activities",
  tender_addressofimplementation_country_missing_share = "Implementation Country",
  tender_addressofimplementation_nuts_missing_share = "Implementation NUTS",
  bidder_masterid_missing_share = "Bidder Master ID",
  bidder_id_missing_share = "Bidder ID",
  bidder_country_missing_share = "Bidder Country",
  bidder_nuts_missing_share = "Bidder NUTS",
  bidder_name_missing_share = "Bidder Name",
  bid_price_missing_share = "Bid Price",
  bid_priceusd_missing_share = "Bid Price (USD)",
  bid_pricecurrency_missing_share = "Bid Price Currency",
  lot_productcode_missing_share = "Product Code",
  lot_localproductcode_type_missing_share = "Local Product Code Type",
  lot_localproductcode_missing_share = "Local Product Code",
  submp_missing_share = "Submission Period",
  decp_missing_share = "Decision Period",
  is_capital_missing_share = "Capital City Indicator",
  lot_title_missing_share = "Lot Title",
  lot_estimatedpriceusd_missing_share = "Estimated Price (USD)",
  lot_estimatedprice_missing_share = "Estimated Price",
  lot_estimatedpricecurrency_missing_share = "Estimated Price Currency"
)

#' Apply label lookup to variable names
#' 
#' @param vec Character vector of variable names
#' @param lookup Named character vector of labels
#' @return Character vector with labels applied
label_with_lookup <- function(vec, lookup) {
  out <- lookup[vec]
  out[is.na(out)] <- vec[is.na(out)]
  unname(out)
}

#' Procedure type labels for display
procedure_type_labels <- c(
  "OPEN" = "Open (competitive bidding)",
  "RESTRICTED" = "Restricted (limited competition)",
  "NEGOTIATED_WITH_PUBLICATION" = "Negotiated with publication (limited competition)",
  "NEGOTIATED_WITHOUT_PUBLICATION" = "Negotiated without publication (no competition)",
  "NEGOTIATED" = "Negotiated (limited competition)",
  "COMPETITIVE_DIALOG" = "Competitive dialogue (limited competition)",
  "OUTRIGHT_AWARD" = "Outright award (direct purchase)",
  "OTHER" = "Other (special or exceptional procedures)",
  "Missing procedure type" = "Missing procedure type"
)

# ========================================================================
# PART 2A: STANDARD PLOT THEME SETTINGS
# ========================================================================

#' Standard text sizes for all plots
#' These are calibrated for fig.width=14, fig.height=12
PLOT_SIZES <- list(
  base_size = 16,
  title_size = 20,
  subtitle_size = 14,
  axis_title_size = 16,
  axis_text_size = 14,
  legend_title_size = 14,
  legend_text_size = 13,
  geom_text_size = 4,      # For geom_text (percentages in tiles, bar labels)
  geom_text_large = 5,     # For larger labels
  line_size = 1.5,
  point_size = 3
)

#' Apply standard theme to a ggplot object
#' 
#' @param base_size Base font size
#' @return ggplot2 theme
standard_plot_theme <- function(base_size = PLOT_SIZES$base_size) {
  ggplot2::theme_minimal(base_size = base_size) +
    ggplot2::theme(
      plot.title = ggplot2::element_text(size = PLOT_SIZES$title_size, face = "bold"),
      plot.subtitle = ggplot2::element_text(size = PLOT_SIZES$subtitle_size),
      axis.title = ggplot2::element_text(size = PLOT_SIZES$axis_title_size),
      axis.text = ggplot2::element_text(size = PLOT_SIZES$axis_text_size),
      legend.title = ggplot2::element_text(size = PLOT_SIZES$legend_title_size),
      legend.text = ggplot2::element_text(size = PLOT_SIZES$legend_text_size)
    )
}

# ========================================================================
# PART 3: DATA LOADING AND VALIDATION
# ========================================================================

#' Load data with consistent settings
#' 
#' @param input_path Path to input file
#' @return data.table
load_data <- function(input_path) {
  data <- data.table::fread(
    input = input_path,
    keepLeadingZeros = TRUE,
    encoding = "UTF-8",
    stringsAsFactors = FALSE,
    showProgress = TRUE,
    na.strings = c("", "-", "NA")
  )
  
  # Drop duplicated column names
  dup_cols <- duplicated(names(data))
  if (any(dup_cols)) {
    warning("Dropping ", sum(dup_cols), " duplicated columns")
    data <- data[, !dup_cols, with = FALSE]
  }
  
  return(data)
}

#' Validate required columns exist
#' 
#' @param df Data frame to check
#' @param required_cols Character vector of required column names
#' @param action_name Name of action requiring columns (for messages)
#' @return Logical indicating whether all columns exist
validate_required_columns <- function(df, required_cols, action_name = "this action") {
  missing <- setdiff(required_cols, names(df))
  
  if (length(missing) > 0) {
    message(sprintf(
      "Skipping %s: missing columns [%s]",
      action_name,
      paste(missing, collapse = ", ")
    ))
    return(FALSE)
  }
  TRUE
}

#' Check overall data quality
#' 
#' @param df Data frame to check
#' @param config Configuration list
#' @return List of quality metrics
check_data_quality <- function(df, config) {
  list(
    n_rows = nrow(df),
    n_cols = ncol(df),
    has_buyer_id = "buyer_masterid" %in% names(df),
    has_bidder_id = "bidder_masterid" %in% names(df),
    has_tender_year = "tender_year" %in% names(df),
    has_price = "bid_price" %in% names(df) | "bid_priceusd" %in% names(df),
    year_range = if ("tender_year" %in% names(df)) {
      range(df$tender_year, na.rm = TRUE)
    } else {
      NULL
    },
    n_unique_buyers = if ("buyer_masterid" %in% names(df)) {
      dplyr::n_distinct(df$buyer_masterid, na.rm = TRUE)
    } else {
      NA_integer_
    },
    n_unique_bidders = if ("bidder_masterid" %in% names(df)) {
      dplyr::n_distinct(df$bidder_masterid, na.rm = TRUE)
    } else {
      NA_integer_
    }
  )
}

# ========================================================================
# PART 4: DATA PREPARATION HELPERS
# ========================================================================

#' Add tender year from available date fields
#' 
#' @param df Data frame
#' @return Data frame with tender_year column
add_tender_year <- function(df) {
  df %>%
    dplyr::mutate(
      tender_year = dplyr::coalesce(
        stringr::str_extract(tender_publications_firstcallfortenderdate, "^\\d{4}"),
        stringr::str_extract(tender_awarddecisiondate, "^\\d{4}"),
        stringr::str_extract(tender_biddeadline, "^\\d{4}")
      ),
      tender_year = as.integer(tender_year)
    )
}

#' Add buyer group classification
#' 
#' @param buyer_buyertype Character vector of buyer types
#' @return Factor with buyer groups
add_buyer_group <- function(buyer_buyertype) {
  group <- dplyr::case_when(
    grepl("(?i)national", buyer_buyertype) ~ "National Buyer",
    grepl("(?i)regional", buyer_buyertype) ~ "Regional Buyer",
    grepl("(?i)utilities", buyer_buyertype) ~ "Utilities",
    grepl("(?i)European", buyer_buyertype) ~ "EU agency",
    TRUE ~ "Other Public Bodies"
  )
  
  factor(
    group,
    levels = c(
      "National Buyer",
      "Regional Buyer",
      "Utilities",
      "EU agency",
      "Other Public Bodies"
    )
  )
}

#' Standardize missing values
#' 
#' @param df Data frame
#' @return Data frame with standardized NAs
standardize_missing_values <- function(df) {
  df %>%
    dplyr::mutate(
      dplyr::across(
        dplyr::where(is.character),
        ~ dplyr::na_if(., "")
      )
    )
}

#' Add commonly used derived variables
#' 
#' @param df Data frame
#' @return Data frame with derived variables
add_derived_variables <- function(df) {
  df <- df %>%
    dplyr::mutate(
      has_buyer_id = !is.na(buyer_masterid),
      has_bidder_id = !is.na(bidder_masterid),
      has_price = !is.na(bid_price) | !is.na(bid_priceusd)
    )
  
  # Add buyer group if buyertype exists
  if ("buyer_buyertype" %in% names(df)) {
    df <- df %>%
      dplyr::mutate(buyer_group = add_buyer_group(buyer_buyertype))
  }
  
  df
}

#' Clean NUTS3 codes
#' 
#' @param df Data frame
#' @param nuts_col Column name containing NUTS codes
#' @return Data frame with cleaned nuts3 column
clean_nuts3 <- function(df, nuts_col = buyer_nuts) {
  nuts_quo <- rlang::enquo(nuts_col)
  
  df %>%
    dplyr::mutate(
      buyer_nuts = as.character(!!nuts_quo),
      nuts_clean = buyer_nuts %>%
        stringr::str_replace_all("\\[|\\]", "") %>%
        stringr::str_replace_all('"', "") %>%
        stringr::str_squish(),
      nuts3 = dplyr::if_else(
        stringr::str_detect(nuts_clean, "^[A-Z]{2}[0-9]{3}$"),
        nuts_clean,
        NA_character_
      )
    )
}

#' Prepare data for analysis
#' 
#' @param df Data frame
#' @return Prepared data frame
prepare_data <- function(df) {
  df %>%
    add_tender_year() %>%
    standardize_missing_values() %>%
    add_derived_variables()
}

# ========================================================================
# PART 5: MISSING VALUE ANALYSIS
# ========================================================================

#' Compute missing shares across columns
#' 
#' @param df Data frame
#' @param cols Columns to analyze (tidy-select)
#' @return Data frame with missing share columns
#' Compute missing shares across columns
summarise_missing_shares <- function(df, cols = !dplyr::starts_with("ind_")) {
  df %>%
    dplyr::summarise(
      dplyr::across(
        .cols = {{ cols }} & !ends_with("_missing_share"),  # Remove dplyr:: prefix from ends_with
        .fns = ~ mean(is.na(.)),
        .names = "{.col}_missing_share"
      ),
      .groups = "drop"
    )
}

#' Pivot missing shares to long format
#' 
#' @param df Data frame with missing share columns
#' @param id_vars ID columns to keep
#' @return Long format data frame
pivot_missing_long <- function(df, id_vars = NULL) {
  if (is.null(id_vars)) {
    df %>%
      tidyr::pivot_longer(
        cols = dplyr::everything(),
        names_to = "variable",
        values_to = "missing_share"
      )
  } else {
    df %>%
      tidyr::pivot_longer(
        cols = -dplyr::all_of(id_vars),
        names_to = "variable",
        values_to = "missing_share"
      )
  }
}

#' Compute organization-level missing shares for interoperability
#' 
#' @param df Data frame
#' @return Data frame with organization type and missing shares
compute_org_missing <- function(df) {
  miss_or_na <- function(col) {
    if (col %in% names(df)) {
      mean(is.na(df[[col]]))
    } else {
      NA_real_
    }
  }
  
  tibble::tribble(
    ~organization_type, ~id_type, ~missing_share,
    "Supplier", "Source ID", miss_or_na("bidder_id"),
    "Supplier", "Generated ID", miss_or_na("bidder_masterid"),
    "Supplier", "Name", miss_or_na("bidder_name"),
    "Supplier", "Address (postcode)", miss_or_na("bidder_nuts"),
    "Buyer", "Source ID", miss_or_na("buyer_id"),
    "Buyer", "Generated ID", miss_or_na("buyer_masterid"),
    "Buyer", "Name", miss_or_na("buyer_name"),
    "Buyer", "Address (postcode)", miss_or_na("buyer_postcode")
  )
}

#' Compute missing correlations
#' 
#' @param df Data frame
#' @return Correlation matrix
compute_missing_correlations <- function(df) {
  miss_matrix <- df %>%
    dplyr::select(!dplyr::starts_with("ind_")) %>%
    dplyr::mutate(dplyr::across(dplyr::everything(), ~ as.numeric(is.na(.))))
  
  corrr::correlate(miss_matrix)
}

# ========================================================================
# PART 6: PLOTTING FUNCTIONS
# ========================================================================

#' Create plot saver function
#' 
#' @param output_dir Output directory
#' @param config Configuration list
#' @return Function to save plots
create_plot_saver <- function(output_dir, config) {
  function(plot, filename, width = NULL, height = NULL) {
    if (is.null(plot)) return(invisible(NULL))
    
    ggplot2::ggsave(
      filename = file.path(output_dir, paste0(filename, "_", config$country, ".png")),
      plot = plot,
      width = width %||% config$plots$width,
      height = height %||% config$plots$height,
      dpi = config$plots$dpi
    )
    
    invisible(TRUE)
  }
}

#' Plot missing share bar chart
#' 
#' @param data_long Long format data with variable and missing_share
#' @param label_lookup Named vector for labels
#' @param title Plot title
#' @return ggplot object
plot_missing_bar <- function(data_long, label_lookup, title = "Missing Value Share per Variable") {
  ggplot2::ggplot(
    data_long,
    ggplot2::aes(
      x = reorder(variable, missing_share),
      y = missing_share
    )
  ) +
    ggplot2::geom_col(fill = "lightblue") +
    ggplot2::coord_flip() +
    ggplot2::scale_x_discrete(
      labels = function(x) label_with_lookup(x, label_lookup)
    ) +
    ggplot2::scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
    ggplot2::labs(
      title = title,
      x = "Variable",
      y = "Missing Share"
    ) +
    standard_plot_theme()
}

#' Generic missing share heatmap with dynamic sizing
#'
#' @param data Data frame
#' @param x_var Name of x variable
#' @param y_var Name of y variable
#' @param fill_var Name of fill variable
#' @param title Plot title
#' @param x_lab X-axis label
#' @param y_lab Y-axis label
#' @param height_per_row Height per y-axis item (default 0.4)
#' @param text_size Size of percentage text in cells (default 3.5)
#' @return ggplot object with suggested_height attribute
plot_missing_heatmap <- function(data, x_var, y_var, fill_var,
                                 title, x_lab, y_lab,
                                 height_per_row = 0.3,
                                 text_size = NULL) {
  x_sym <- rlang::sym(x_var)
  y_sym <- rlang::sym(y_var)
  fill_sym <- rlang::sym(fill_var)
  
  # Use standard size or adjust based on data
  if (is.null(text_size)) {
    n_y_values <- length(unique(data[[y_var]]))
    if (n_y_values > 40) {
      text_size <- PLOT_SIZES$geom_text_size - 0.5
    } else if (n_y_values > 30) {
      text_size <- PLOT_SIZES$geom_text_size - 0.2
    } else {
      text_size <- PLOT_SIZES$geom_text_size
    }
  }
  
  p <- ggplot2::ggplot(
    data,
    ggplot2::aes(
      x = !!x_sym,
      y = !!y_sym,
      fill = !!fill_sym
    )
  ) +
    ggplot2::geom_tile(color = "white") +
    ggplot2::geom_text(
      ggplot2::aes(label = scales::percent(!!fill_sym, accuracy = 1)),
      size = text_size,
      color = "black"
    ) +
    ggplot2::scale_fill_gradient(
      name = "Missing share",
      labels = scales::percent_format(accuracy = 1),
      limits = c(0, 1),
      low = "skyblue1",
      high = "indianred1"
    ) +
    ggplot2::labs(
      title = title,
      x = x_lab,
      y = y_lab
    ) +
    standard_plot_theme() +
    ggplot2::theme(
      axis.text.x = ggplot2::element_text(angle = 45, hjust = 1)
    )
  
  p
}
#' Plot top N bar chart
#' 
#' @param df Data frame
#' @param x_var Name of x variable
#' @param y_var Name of y variable
#' @param label_var Name of label variable
#' @param title Plot title
#' @param x_lab X-axis label
#' @param y_lab Y-axis label
#' @param fill_color Bar fill color
#' @param y_limit Y-axis limits
#' @param percent Whether to format labels as percent
#' @return ggplot object
plot_top_bar <- function(df, x_var, y_var, label_var,
                         title, x_lab, y_lab,
                         fill_color = "skyblue1",
                         y_limit = c(0, 1.05),
                         percent = TRUE) {
  x_sym <- rlang::sym(x_var)
  y_sym <- rlang::sym(y_var)
  lab_sym <- rlang::sym(label_var)
  
  p <- ggplot2::ggplot(
    df,
    ggplot2::aes(
      x = reorder(!!x_sym, !!y_sym),
      y = !!y_sym
    )
  ) +
    ggplot2::geom_col(fill = fill_color) +
    ggplot2::geom_text(
      ggplot2::aes(
        label = if (percent) {
          scales::percent(!!y_sym, accuracy = 0.1)
        } else {
          !!lab_sym
        }
      ),
      hjust = -0.1,
      size = PLOT_SIZES$geom_text_large,
      check_overlap = TRUE
    ) +
    ggplot2::coord_flip() +
    ggplot2::labs(
      title = title,
      x = x_lab,
      y = y_lab
    ) +
    ggplot2::theme_bw(base_size = PLOT_SIZES$base_size) +
    ggplot2::theme(
      plot.title = ggplot2::element_text(size = PLOT_SIZES$title_size, face = "bold"),
      axis.title = ggplot2::element_text(size = PLOT_SIZES$axis_title_size),
      axis.text = ggplot2::element_text(size = PLOT_SIZES$axis_text_size)
    )
  
  if (!is.null(y_limit)) {
    p <- p + ggplot2::ylim(y_limit)
  }
  
  p
}

#' Plot ggeffects line with confidence band
#' 
#' @param pred ggeffects prediction object
#' @param title Plot title
#' @param subtitle Plot subtitle
#' @param x_lab X-axis label
#' @param y_lab Y-axis label
#' @param caption Plot caption
#' @param wrap_width Width for text wrapping
#' @return ggplot object
plot_ggeffects_line <- function(pred,
                                title,
                                subtitle,
                                x_lab,
                                y_lab,
                                caption = NULL,
                                wrap_width = 110) {
  
  subtitle_w <- if (!is.null(subtitle)) stringr::str_wrap(subtitle, width = wrap_width) else NULL
  caption_w <- if (!is.null(caption)) stringr::str_wrap(caption, width = wrap_width) else NULL
  
  ggplot2::ggplot(pred, ggplot2::aes(x, predicted)) +
    ggplot2::geom_line(size = PLOT_SIZES$line_size, color = "lightblue") +
    ggplot2::geom_ribbon(ggplot2::aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
    ggplot2::labs(
      title = title,
      subtitle = subtitle_w,
      x = x_lab,
      y = y_lab,
      caption = caption_w
    ) +
    standard_plot_theme() +
    ggplot2::theme(
      plot.subtitle = ggplot2::element_text(size = PLOT_SIZES$subtitle_size, lineheight = 1.05),
      plot.caption = ggplot2::element_text(size = PLOT_SIZES$subtitle_size - 2, lineheight = 1.05),
      plot.margin = ggplot2::margin(10, 18, 10, 10)
    )
}
# ========================================================================
# PART 7: MODEL SPECIFICATION HELPERS
# ========================================================================

#' Build fixed effects formula part
#' 
#' @param fe Fixed effects specification
#' @return Character string for formula
make_fe_part <- function(fe) {
  switch(
    fe,
    "0" = "0",
    "buyer" = "buyer_masterid",
    "year" = "tender_year",
    "buyer+year" = "buyer_masterid + tender_year",
    "buyer#year" = "buyer_masterid^tender_year",
    stop("Unknown FE spec: ", fe)
  )
}

#' Build cluster formula
#' 
#' @param cluster Cluster specification
#' @return Formula or NULL
make_cluster <- function(cluster) {
  switch(
    cluster,
    "none" = NULL,
    "buyer" = stats::as.formula("~ buyer_masterid"),
    "year" = stats::as.formula("~ tender_year"),
    "buyer_year" = stats::as.formula("~ buyer_masterid + tender_year"),
    "buyer_buyertype" = stats::as.formula("~ buyer_masterid + buyer_buyertype"),
    stop("Unknown cluster spec: ", cluster)
  )
}

#' Safe fixest model fitting
#' 
#' @param expr Expression to evaluate
#' @return Model or NULL
safe_fixest <- function(expr) {
  tryCatch(expr, error = function(e) NULL)
}

#' Extract effect from fixest model
#' 
#' @param model fixest model object
#' @param x_name Name of variable to extract
#' @param data_used Data used for estimation
#' @param y_name Name of outcome variable
#' @return List with estimate, pvalue, nobs, std_slope
extract_effect_fixest <- function(model, x_name, data_used, y_name = NULL) {
  s <- summary(model)
  ct <- s$coeftable
  
  if (!(x_name %in% rownames(ct))) {
    return(list(
      estimate = NA_real_,
      pvalue = NA_real_,
      nobs = s$nobs,
      std_slope = NA_real_
    ))
  }
  
  est <- as.numeric(ct[x_name, "Estimate"])
  pv <- as.numeric(ct[x_name, "Pr(>|t|)"])
  if (is.na(pv) && "Pr(>|z|)" %in% colnames(ct)) {
    pv <- as.numeric(ct[x_name, "Pr(>|z|)"])
  }
  
  sx <- stats::sd(data_used[[x_name]], na.rm = TRUE)
  std_slope <- est * sx
  
  list(
    estimate = est,
    pvalue = pv,
    nobs = s$nobs,
    std_slope = std_slope
  )
}

#' Extract effect with specific vcov
#' 
#' @param model fixest model
#' @param x_name Variable name
#' @param data_used Data frame
#' @param vcov VCOV specification
#' @return List with estimate, se, t, p
extract_effect_fixest_vcov <- function(model, x_name, data_used, vcov) {
  ct <- tryCatch(
    fixest::coeftable(model, vcov = vcov),
    error = function(e) NULL
  )
  
  if (is.null(ct) || !(x_name %in% rownames(ct))) {
    return(list(estimate = NA_real_, se = NA_real_, t = NA_real_, p = NA_real_))
  }
  
  est <- as.numeric(ct[x_name, "Estimate"])
  se <- as.numeric(ct[x_name, "Std. Error"])
  
  tv <- if ("t value" %in% colnames(ct)) as.numeric(ct[x_name, "t value"]) else NA_real_
  pv <- if ("Pr(>|t|)" %in% colnames(ct)) as.numeric(ct[x_name, "Pr(>|t|)"]) else
    if ("Pr(>|z|)" %in% colnames(ct)) as.numeric(ct[x_name, "Pr(>|z|)"]) else NA_real_
  
  list(estimate = est, se = se, t = tv, p = pv)
}

#' Compute effect at P10 vs P90
#' 
#' @param model Model object
#' @param data_used Data frame
#' @param x_name Variable name
#' @return Numeric effect size
effect_p10_p90 <- function(model, data_used, x_name) {
  qs <- stats::quantile(data_used[[x_name]], probs = c(.1, .9), na.rm = TRUE)
  x_lo <- unname(qs[1])
  x_hi <- unname(qs[2])
  
  typical <- data_used[1, , drop = FALSE]
  for (nm in names(typical)) {
    if (nm == x_name) next
    v <- data_used[[nm]]
    if (is.numeric(v)) {
      typical[[nm]] <- stats::median(v, na.rm = TRUE)
    } else if (is.factor(v) || is.character(v)) {
      tab <- sort(table(v), decreasing = TRUE)
      typical[[nm]] <- names(tab)[1]
      if (is.factor(v)) typical[[nm]] <- factor(typical[[nm]], levels = levels(v))
    }
  }
  
  d_lo <- typical
  d_hi <- typical
  d_lo[[x_name]] <- x_lo
  d_hi[[x_name]] <- x_hi
  
  p_lo <- suppressWarnings(stats::predict(model, newdata = d_lo, type = "response"))
  p_hi <- suppressWarnings(stats::predict(model, newdata = d_hi, type = "response"))
  
  as.numeric(p_hi - p_lo)
}

#' Get default VCOV menu for robustness checks
#' 
#' @param has_buyertype Whether buyer type variable exists
#' @return List of VCOV specifications
get_default_vcov_menu <- function(has_buyertype = TRUE) {
  out <- list(
    "hetero",
    stats::as.formula("~ buyer_masterid"),
    stats::as.formula("~ tender_year"),
    stats::as.formula("~ buyer_masterid + tender_year")
  )
  if (has_buyertype) {
    out <- c(out, list(stats::as.formula("~ buyer_masterid + buyer_buyertype")))
  }
  out
}

#' Compute robustness summary
#' 
#' @param model Model object
#' @param x_name Variable name
#' @param data_used Data frame
#' @param vcov_list List of VCOV specs
#' @param p_max P-value threshold
#' @return List with robustness metrics
robustness_summary <- function(model, x_name, data_used, vcov_list, p_max = 0.10) {
  res <- lapply(vcov_list, function(v) extract_effect_fixest_vcov(model, x_name, data_used, v))
  
  pvals <- vapply(res, function(z) z$p, numeric(1))
  ests <- vapply(res, function(z) z$estimate, numeric(1))
  
  worst_p <- suppressWarnings(max(pvals, na.rm = TRUE))
  pass_count <- sum(!is.na(pvals) & pvals <= p_max)
  tested_count <- sum(!is.na(pvals))
  pass_share <- if (tested_count > 0) pass_count / tested_count else NA_real_
  
  sign_consistent <- {
    s <- sign(ests[!is.na(ests)])
    length(unique(s)) <= 1
  }
  
  list(
    robust_p_worst = if (is.infinite(worst_p)) NA_real_ else worst_p,
    robust_pass_count = pass_count,
    robust_tested = tested_count,
    robust_pass_share = pass_share,
    robust_sign_stable = sign_consistent
  )
}

# ========================================================================
# PART 8: MODEL SELECTION FUNCTIONS
# ========================================================================

#' Pick best model from specifications
#' 
#' @param results_df Data frame of model results
#' @param require_positive Whether to require positive estimate
#' @param p_max Maximum p-value
#' @param strength_col Column name for effect strength
#' @return Single row data frame or NULL
pick_best_model <- function(results_df,
                            require_positive = TRUE,
                            p_max = 0.10,
                            strength_col = c("effect_strength", "std_slope")) {
  strength_col <- match.arg(strength_col)
  
  df <- results_df
  if (require_positive) df <- df[df$estimate > 0, , drop = FALSE]
  df <- df[!is.na(df$pvalue) & df$pvalue <= p_max, , drop = FALSE]
  df <- df[!is.na(df[[strength_col]]), , drop = FALSE]
  if (nrow(df) == 0) return(NULL)
  
  df <- df[order(df[[strength_col]], decreasing = TRUE), , drop = FALSE]
  df[["rank"]] <- seq_len(nrow(df))
  df[1, , drop = FALSE]
}

#' Pick most robust model
#' 
#' @param results_df Data frame with robustness metrics
#' @param require_positive Require positive estimate
#' @param p_max P-value threshold
#' @param strength_col Column for effect strength
#' @return Single row or NULL
pick_most_robust_model <- function(results_df,
                                   require_positive = TRUE,
                                   p_max = 0.10,
                                   strength_col = c("effect_strength", "std_slope")) {
  strength_col <- match.arg(strength_col)
  
  df <- results_df
  if (require_positive) df <- df[df$estimate > 0, , drop = FALSE]
  
  df <- df[!is.na(df$robust_p_worst), , drop = FALSE]
  df <- df[df$robust_p_worst <= p_max, , drop = FALSE]
  df <- df[!is.na(df[[strength_col]]), , drop = FALSE]
  
  if (nrow(df) == 0) return(NULL)
  
  df <- df[order(
    -df$robust_pass_share,
    df$robust_p_worst,
    -df[[strength_col]]
  ), , drop = FALSE]
  
  df[1, , drop = FALSE]
}

#' Model diagnostics
#' 
#' @param specs_df Specification results
#' @param require_positive Require positive estimate
#' @param p_max P-value threshold
#' @param strength_col Strength column name
#' @return List of diagnostic counts
model_diagnostics <- function(specs_df, require_positive = TRUE, p_max = 0.10, strength_col) {
  if (is.null(specs_df) || nrow(specs_df) == 0L) {
    return(list(total = 0L, after_sign = 0L, after_p = 0L, after_strength = 0L))
  }
  
  df <- specs_df
  total <- nrow(df)
  
  df1 <- if (require_positive) df[df$estimate > 0, , drop = FALSE] else df
  after_sign <- nrow(df1)
  
  df2 <- df1[!is.na(df1$pvalue) & df1$pvalue <= p_max, , drop = FALSE]
  after_p <- nrow(df2)
  
  ok_strength <- !is.na(df2[[strength_col]])
  df3 <- df2[ok_strength, , drop = FALSE]
  after_strength <- nrow(df3)
  
  list(total = total, after_sign = after_sign, after_p = after_p, after_strength = after_strength)
}

# ========================================================================
# PART 9: SENSITIVITY ANALYSIS FUNCTIONS
# ========================================================================

#' Add strength column to specifications
#' 
#' @param specs Specifications data frame
#' @return Data frame with strength column
add_strength_column <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(specs)
  
  if ("effect_strength" %in% names(specs)) {
    specs$strength <- specs$effect_strength
  } else if ("std_slope" %in% names(specs)) {
    specs$strength <- specs$std_slope
  } else {
    specs$strength <- NA_real_
  }
  specs
}

#' Summarize sensitivity overall
#' 
#' @param specs Specifications data frame
#' @param p_levels P-value levels to check
#' @return Summary tibble
summarise_sensitivity_overall <- function(specs, p_levels = c(0.05, 0.10, 0.20)) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs <- add_strength_column(specs)
  
  tibble::tibble(
    n_specs = nrow(specs),
    share_positive = mean(specs$estimate > 0, na.rm = TRUE),
    share_negative = mean(specs$estimate < 0, na.rm = TRUE),
    median_estimate = median(specs$estimate, na.rm = TRUE),
    median_pvalue = median(specs$pvalue, na.rm = TRUE),
    median_strength = median(specs$strength, na.rm = TRUE),
    !!!setNames(
      lapply(p_levels, function(p) mean(specs$pvalue <= p, na.rm = TRUE)),
      paste0("share_p_le_", p_levels)
    )
  )
}

#' Summarize sign instability
#' 
#' @param specs Specifications data frame
#' @return Summary tibble
summarise_sign_instability <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  s <- sign(specs$estimate)
  s <- s[!is.na(s) & s != 0]
  tibble::tibble(
    share_sign_stable = if (length(s) == 0) NA_real_ else as.numeric(length(unique(s)) <= 1),
    n_nonzero = length(s)
  )
}

#' Summarize by fixed effects
#' 
#' @param specs Specifications data frame
#' @return Summary tibble
summarise_by_fe <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs <- add_strength_column(specs)
  
  specs %>%
    dplyr::group_by(fe) %>%
    dplyr::summarise(
      n_specs = dplyr::n(),
      share_positive = mean(estimate > 0, na.rm = TRUE),
      share_p10 = mean(pvalue <= 0.10, na.rm = TRUE),
      median_p = median(pvalue, na.rm = TRUE),
      median_strength = median(strength, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::arrange(dplyr::desc(share_p10))
}

#' Summarize by cluster
#' 
#' @param specs Specifications data frame
#' @return Summary tibble
summarise_by_cluster <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs <- add_strength_column(specs)
  
  specs %>%
    dplyr::group_by(cluster) %>%
    dplyr::summarise(
      n_specs = dplyr::n(),
      share_positive = mean(estimate > 0, na.rm = TRUE),
      share_p10 = mean(pvalue <= 0.10, na.rm = TRUE),
      median_p = median(pvalue, na.rm = TRUE),
      median_strength = median(strength, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::arrange(median_p)
}

#' Summarize by controls
#' 
#' @param specs Specifications data frame
#' @return Summary tibble
summarise_by_controls <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs <- add_strength_column(specs)
  
  specs %>%
    dplyr::group_by(controls) %>%
    dplyr::summarise(
      n_specs = dplyr::n(),
      share_positive = mean(estimate > 0, na.rm = TRUE),
      share_p10 = mean(pvalue <= 0.10, na.rm = TRUE),
      median_p = median(pvalue, na.rm = TRUE),
      median_strength = median(strength, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::arrange(dplyr::desc(share_p10))
}

#' Classify specifications
#' 
#' @param specs Specifications data frame
#' @param p_cut P-value cutoff
#' @return Classification tibble
classify_specs <- function(specs, p_cut = 0.10) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs %>%
    dplyr::mutate(
      class = dplyr::case_when(
        estimate > 0 & pvalue <= p_cut ~ "Positive & significant",
        estimate > 0 ~ "Positive but insignificant",
        estimate < 0 & pvalue <= p_cut ~ "Negative & significant",
        estimate < 0 ~ "Negative but insignificant",
        TRUE ~ "Missing/NA"
      )
    ) %>%
    dplyr::count(class) %>%
    dplyr::mutate(share = n / sum(n))
}

#' Find top specification cells
#' 
#' @param specs Specifications data frame
#' @param p_cut P-value cutoff
#' @param n_top Number of top cells
#' @return Top cells tibble
top_cells <- function(specs, p_cut = 0.10, n_top = 10) {
  if (is.null(specs) || nrow(specs) == 0L) return(tibble::tibble())
  specs %>%
    dplyr::mutate(p_ok = pvalue <= p_cut) %>%
    dplyr::group_by(fe, cluster, controls) %>%
    dplyr::summarise(
      n = dplyr::n(),
      share_pok = mean(p_ok, na.rm = TRUE),
      median_p = median(pvalue, na.rm = TRUE),
      median_est = median(estimate, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::arrange(dplyr::desc(share_pok), median_p) %>%
    dplyr::slice_head(n = n_top)
}

#' Build sensitivity bundle
#' 
#' @param specs Specifications data frame
#' @return List of sensitivity summaries
build_sensitivity_bundle <- function(specs) {
  if (is.null(specs) || nrow(specs) == 0L) return(list())
  specs <- add_strength_column(specs)
  
  list(
    overall = summarise_sensitivity_overall(specs),
    sign = summarise_sign_instability(specs),
    by_fe = summarise_by_fe(specs),
    by_cluster = summarise_by_cluster(specs),
    by_controls = summarise_by_controls(specs),
    classes = classify_specs(specs),
    top_cells = top_cells(specs)
  )
}

# ========================================================================
# PART 10: DISPLAY HELPERS
# ========================================================================

#' Pretty model name
#' 
#' @param model_type Model type string
#' @return Readable name
pretty_model_name <- function(model_type) {
  switch(
    model_type,
    "fractional_logit" = "Fractional Logit (quasi-binomial, logit link)",
    "ols_level" = "OLS (levels)",
    "ols_log" = "OLS (log-transformed outcome)",
    "gamma_log" = "Gamma GLM (log link)",
    model_type
  )
}

#' Pretty controls label
#' 
#' @param ctrl Controls specification
#' @return Readable label
pretty_controls_label <- function(ctrl) {
  switch(
    ctrl,
    "x_only" = "No controls",
    "base" = "Core controls",
    "base_extra" = "Core + extra controls",
    ctrl
  )
}

#' Pretty FE label
#' 
#' @param fe Fixed effects specification
#' @return Readable label
pretty_fe_label <- function(fe) {
  switch(
    fe,
    "0" = "No fixed effects",
    "buyer" = "Buyer FE",
    "year" = "Year FE",
    "buyer+year" = "Buyer + Year FE",
    "buyer#year" = "Buyer×Year FE",
    fe
  )
}

#' Controls note for captions
#' 
#' @param ctrl Controls specification
#' @return Note string
controls_note <- function(ctrl) {
  switch(
    ctrl,
    "x_only" = "Controls: none (missingness only).",
    "base" = "Controls: log(contract value), buyer type, procedure type.",
    "base_extra" = "Controls: log(contracts), log(avg contract value), log(total value), buyer type.",
    paste0("Controls: ", ctrl, ".")
  )
}

#' FE counts note
#' 
#' @param data Estimation data
#' @param fe Fixed effects specification
#' @return Note string
fe_counts_note <- function(data, fe) {
  if (is.null(data) || nrow(data) == 0) return("FE counts: N/A")
  
  if (fe == "buyer") {
    return(paste0("FE groups: buyers=", dplyr::n_distinct(data$buyer_masterid)))
  }
  if (fe == "year") {
    return(paste0("FE groups: years=", dplyr::n_distinct(data$tender_year)))
  }
  if (fe == "buyer+year") {
    return(paste0(
      "FE groups: buyers=", dplyr::n_distinct(data$buyer_masterid),
      ", years=", dplyr::n_distinct(data$tender_year)
    ))
  }
  if (fe == "buyer#year") {
    return(paste0(
      "FE groups: buyer×year=",
      dplyr::n_distinct(paste(data$buyer_masterid, data$tender_year, sep = "_"))
    ))
  }
  "FE counts: N/A"
}

# ========================================================================
# PART 11: MODULE - MISSING VALUE ANALYSIS (with dynamic heights)
# ========================================================================

#' Analyze missing values
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_missing_values <- function(df, config, output_dir) {
  results <- list()
  save_plot <- create_plot_saver(output_dir, config)
  
  # Overall missing shares
  missing_shares <- df %>%
    summarise_missing_shares(cols = !dplyr::starts_with("ind_"))
  
  results$overall <- missing_shares
  results$overall_long <- pivot_missing_long(missing_shares)
  
  results$overall_plot <- plot_missing_bar(
    data_long = results$overall_long,
    label_lookup = label_lookup,
    title = paste("Missing Value Share per Variable –", config$country)
  )
  save_plot(results$overall_plot, "missing_shares", width = 10, height = 8)  # FIXED HEIGHT
  
  # By buyer type
  if (validate_required_columns(df, "buyer_buyertype", "missing by buyer type")) {
    results$by_buyer <- df %>%
      dplyr::mutate(buyer_group = add_buyer_group(buyer_buyertype)) %>%
      dplyr::group_by(buyer_group) %>%
      summarise_missing_shares(cols = -dplyr::starts_with("ind_")) %>%
      pivot_missing_long(id_vars = "buyer_group") %>%
      dplyr::mutate(variable_label = label_with_lookup(variable, label_lookup))
    
    results$by_buyer_plot <- plot_missing_heatmap(
      data = results$by_buyer,
      x_var = "buyer_group",
      y_var = "variable_label",
      fill_var = "missing_share",
      title = paste("Missing value share per variable by buyer type –", config$country),
      x_lab = "Buyer type",
      y_lab = "Variable",
      height_per_row = 0.35,
      text_size = 3
    )
  }
  
  # Top buyers with highest missing share
  # Top buyers with highest missing share
  if (validate_required_columns(df, c("buyer_masterid", "buyer_buyertype"), "top missing buyers")) {
    results$top_buyers_missing <- df %>%
      dplyr::select(!dplyr::starts_with("ind_")) %>%
      dplyr::group_by(buyer_masterid, buyer_buyertype) %>%
      dplyr::filter(dplyr::n() >= config$thresholds$min_buyer_contracts) %>%
      dplyr::summarise(
        missing_share = mean(is.na(dplyr::cur_data_all()), na.rm = TRUE),
        .groups = "drop"
      ) %>%
      dplyr::arrange(dplyr::desc(missing_share)) %>%
      dplyr::slice_head(n = config$thresholds$top_n_buyers) %>%
      dplyr::mutate(
        # ADDED: Apply buyer group transformation
        buyer_group = add_buyer_group(buyer_buyertype),
        buyer_group_label = as.character(buyer_group),
        # UPDATED: Use buyer_group_label instead of buyer_buyertype
        buyer_label = paste0(
          buyer_masterid, "\n",
          buyer_group_label  # CHANGED from buyer_buyertype
        ),
        buyer_label_short = dplyr::if_else(
          nchar(buyer_masterid) > 20,
          paste0(substr(buyer_masterid, 1, 20), "…\n", buyer_group_label),  # CHANGED
          buyer_label
        )
      )
    
    results$top_buyers_plot <- plot_top_bar(
      df = results$top_buyers_missing,
      x_var = "buyer_label_short",
      y_var = "missing_share",
      label_var = "missing_share",
      title = paste("Top buyers with highest overall missing share –", config$country),
      x_lab = "Buyer",
      y_lab = "Missing share",
      fill_color = "skyblue1",
      y_limit = c(0, 1.05),
      percent = TRUE
    )
  }
  
  # Combined buyers plot - USE FIXED LARGE HEIGHT
  if (!is.null(results$by_buyer_plot) && !is.null(results$top_buyers_plot)) {
    results$combined_buyers_plot <- results$by_buyer_plot + results$top_buyers_plot +
      patchwork::plot_layout(nrow = 2)
    
    save_plot(
      results$combined_buyers_plot,
      "buyers_missing",
      width = 12,
      height = 16  # FIXED LARGE HEIGHT (increased from 12)
    )
  }
  
  # By procedure type - USE FIXED HEIGHT
  if (validate_required_columns(df, "tender_proceduretype", "missing by procedure type")) {
    results$by_procedure <- df %>%
      dplyr::mutate(
        proc_group = ifelse(
          is.na(tender_proceduretype),
          "Missing procedure type",
          as.character(tender_proceduretype)
        )
      ) %>%
      dplyr::group_by(proc_group) %>%
      summarise_missing_shares(cols = -dplyr::starts_with("ind_")) %>%
      pivot_missing_long(id_vars = "proc_group") %>%
      dplyr::mutate(
        proc_group_label = ifelse(
          proc_group %in% names(procedure_type_labels),
          procedure_type_labels[proc_group],
          proc_group
        ),
        variable_label = label_with_lookup(variable, label_lookup),
        proc_group_label = factor(
          proc_group_label,
          levels = c(
            "Open (competitive bidding)",
            "Restricted (limited competition)",
            "Negotiated with publication (limited competition)",
            "Negotiated without publication (no competition)",
            "Negotiated (limited competition)",
            "Competitive dialogue (limited competition)",
            "Outright award (direct purchase)",
            "Other (special or exceptional procedures)",
            "Missing procedure type"
          )
        )
      )
    
    results$by_procedure_plot <- plot_missing_heatmap(
      data = results$by_procedure,
      x_var = "proc_group_label",
      y_var = "variable_label",
      fill_var = "missing_share",
      title = paste("Missing value share per variable by procedure type –", config$country),
      x_lab = "Procedure Type",
      y_lab = "Variable",
      height_per_row = 0.35,
      text_size = 3
    )
    
    save_plot(results$by_procedure_plot, "missing_by_proc", width = 12, height = 16)  # FIXED LARGE HEIGHT
  }
  
  # Missing correlations
  results$correlations <- compute_missing_correlations(df)
  
  miss_corr_long <- results$correlations %>%
    corrr::stretch(na.rm = TRUE)
  
  results$correlation_plot <- ggplot2::ggplot(
    miss_corr_long,
    ggplot2::aes(x = x, y = y, fill = r)
  ) +
    ggplot2::geom_tile() +
    ggplot2::scale_fill_gradient2(
      low = "blue",
      mid = "white",
      high = "red",
      midpoint = 0,
      limits = c(-1, 1)
    ) +
    ggplot2::coord_equal() +
    ggplot2::labs(
      title = paste("Correlation of NAs across variables –", config$country),
      x = "Variable",
      y = "Variable",
      fill = "Correlation"
    ) +
    standard_plot_theme() +
    ggplot2::theme(
      axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust = 1)
    )
  
  save_plot(results$correlation_plot, "missing_corr", width = 10, height = 8)  # FIXED
  
  results$correlation_top <- miss_corr_long %>%
    dplyr::filter(x != y) %>%
    dplyr::arrange(dplyr::desc(abs(r))) %>%
    dplyr::slice_head(n = 20)
  
  # Missing by year
  if (validate_required_columns(df, "tender_year", "missing by year")) {
    results$by_year <- df %>%
      dplyr::filter(!is.na(tender_year)) %>%
      dplyr::group_by(tender_year) %>%
      summarise_missing_shares(cols = !dplyr::starts_with("ind_")) %>%
      pivot_missing_long(id_vars = "tender_year") %>%
      dplyr::mutate(variable_label = label_with_lookup(variable, label_lookup))
    
    top_vars <- results$by_year %>%
      dplyr::group_by(variable_label) %>%
      dplyr::summarise(avg_missing = mean(missing_share, na.rm = TRUE), .groups = "drop") %>%
      dplyr::slice_max(avg_missing, n = config$thresholds$top_n_vars) %>%
      dplyr::pull(variable_label)
    
    results$by_year_plot <- results$by_year %>%
      dplyr::filter(variable_label %in% top_vars) %>%
      ggplot2::ggplot(ggplot2::aes(x = tender_year, y = missing_share, color = variable_label)) +
      ggplot2::geom_line(size = PLOT_SIZES$line_size) +
      ggplot2::geom_point(size = PLOT_SIZES$point_size) +
      ggplot2::scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
      ggplot2::labs(
        title = paste("Trends in missing shares over time (top variables) –", config$country),
        x = "Year",
        y = "Missing share",
        color = "Variable"
      ) +
      standard_plot_theme() +
      ggplot2::theme(
        legend.position = "bottom",
        legend.key.height = ggplot2::unit(0.5, "cm")
      ) +
      ggplot2::guides(color = ggplot2::guide_legend(ncol = 2))
    
    save_plot(results$by_year_plot, "year_miss", width = 10, height = 6)  # FIXED
  }
  
  results
}
# ========================================================================
# PART 12: MODULE - INTEROPERABILITY ANALYSIS
# ========================================================================

#' Analyze interoperability (organization-level matching)
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_interoperability <- function(df, config, output_dir) {
  results <- list()
  
  results$org_missing <- compute_org_missing(df)
  
  results
}

# ========================================================================
# PART 13: MODULE - COMPETITION ANALYSIS
# ========================================================================

#' Build CPV cluster data frame
#' 
#' @param df Data frame
#' @param cpv_digits Number of CPV digits
#' @return Data frame with CPV clusters
build_cpv_df <- function(df, cpv_digits = 3) {
  df %>%
    dplyr::select(
      tender_id,
      lot_number,
      bidder_id,
      lot_productcode,
      bid_priceusd
    ) %>%
    dplyr::mutate(
      cpv_cluster = stringr::str_sub(lot_productcode, 1, cpv_digits)
    )
}

#' Analyze buyer-supplier concentration
#' 
#' @param df Data frame
#' @param config Configuration list
#' @return List of concentration data and plots
analyze_buyer_supplier_concentration <- function(df, config) {
  results <- list()
  
  if (!validate_required_columns(
    df,
    c("buyer_masterid", "bidder_masterid", "bid_priceusd"),
    "buyer-supplier concentration"
  )) {
    return(results)
  }
  
  results$data <- df %>%
    dplyr::group_by(tender_year, buyer_masterid, bidder_masterid) %>%
    dplyr::summarise(
      n_contracts = dplyr::n(),
      total_spend = sum(bid_priceusd, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::group_by(tender_year, buyer_masterid) %>%
    dplyr::mutate(
      buyer_total_spend = sum(total_spend, na.rm = TRUE),
      buyer_total_contract = dplyr::n(),
      buyer_concentration = ifelse(
        buyer_total_spend > 0,
        total_spend / buyer_total_spend,
        NA_real_
      )
    ) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(tender_year, buyer_masterid) %>%
    dplyr::mutate(
      n_suppliers = dplyr::n(),
      buyer_concentration_display = ifelse(
        n_suppliers < config$thresholds$min_suppliers_for_buyer_conc,
        NA_real_,
        buyer_concentration
      )
    ) %>%
    dplyr::ungroup() %>%
    tidyr::drop_na()
  
  # Overall concentration
  top_buyers <- results$data %>%
    dplyr::group_by(buyer_masterid) %>%
    dplyr::summarise(
      max_conc = max(buyer_concentration_display, na.rm = TRUE),
      total_contracts = sum(n_contracts, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::slice_max(max_conc, n = config$thresholds$top_n_buyers)
  
  results$overall_plot <- plot_top_bar(
    df = top_buyers,
    x_var = "buyer_masterid",
    y_var = "max_conc",
    label_var = "total_contracts",
    title = paste("Top Buyers by Maximum Supplier Concentration –", config$country),
    x_lab = "Buyer ID",
    y_lab = "Max share of spending to one supplier",
    fill_color = "skyblue1",
    y_limit = c(0, 1.05),
    percent = FALSE
  )
  
  # Yearly concentration
  top_buyers_yearly <- results$data %>%
    dplyr::filter(tender_year > 2014) %>%
    dplyr::group_by(tender_year, buyer_masterid) %>%
    dplyr::summarise(
      max_conc = max(buyer_concentration_display, na.rm = TRUE),
      total_contracts = sum(n_contracts, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::group_by(tender_year) %>%
    dplyr::slice_max(max_conc, n = 30) %>%
    dplyr::ungroup()
  
  repeated_buyers <- top_buyers_yearly %>%
    dplyr::count(buyer_masterid) %>%
    dplyr::filter(n > 1) %>%
    dplyr::transmute(buyer_masterid, repeated = TRUE)
  
  top_buyers_yearly <- top_buyers_yearly %>%
    dplyr::left_join(repeated_buyers, by = "buyer_masterid") %>%
    dplyr::mutate(repeated = dplyr::if_else(is.na(repeated), FALSE, repeated))
  
  results$yearly_plot <- ggplot2::ggplot(
    top_buyers_yearly,
    ggplot2::aes(
      x = tidytext::reorder_within(buyer_masterid, max_conc, tender_year),
      y = max_conc,
      fill = repeated
    )
  ) +
    ggplot2::geom_col() +
    ggplot2::geom_text(
      ggplot2::aes(label = total_contracts, color = repeated),
      hjust = -0.1,
      size = PLOT_SIZES$geom_text_size - 0.5,
      show.legend = FALSE
    ) +
    ggplot2::coord_flip() +
    ggplot2::facet_wrap(~ tender_year, scales = "free_y") +
    tidytext::scale_x_reordered(labels = function(x) substr(x, 1, 8)) +
    ggplot2::scale_fill_manual(
      values = c(`FALSE` = "skyblue1", `TRUE` = "red"),
      name = "Buyer appears\nin multiple years",
      labels = c("No", "Yes")
    ) +
    ggplot2::scale_color_manual(values = c(`FALSE` = "black", `TRUE` = "red4")) +
    ggplot2::labs(
      title = paste("Top 30 Buyers by Maximum Spending Concentration per Year –", config$country),
      subtitle = "Red bars = buyers that appear in the top list in multiple years",
      x = "Buyer ID",
      y = "Max share of spending to one supplier"
    ) +
    ggplot2::theme_bw(base_size = PLOT_SIZES$base_size) +
    ggplot2::theme(
      plot.title = ggplot2::element_text(size = PLOT_SIZES$title_size, face = "bold"),
      plot.subtitle = ggplot2::element_text(size = PLOT_SIZES$subtitle_size),
      axis.text = ggplot2::element_text(size = PLOT_SIZES$axis_text_size - 2),
      axis.title = ggplot2::element_text(size = PLOT_SIZES$axis_title_size),
      strip.text = ggplot2::element_text(size = PLOT_SIZES$subtitle_size),
      legend.text = ggplot2::element_text(size = PLOT_SIZES$legend_text_size),
      legend.title = ggplot2::element_text(size = PLOT_SIZES$legend_title_size)
    ) +
    ggplot2::ylim(0, 1.05)
  
  results
}

#' Run single-bidding specifications
#' 
#' @param buyer_analysis_fe Buyer analysis data with FE-eligible buyers
#' @param config Configuration list
#' @return Data frame of specification results
run_singleb_specs <- function(buyer_analysis_fe, config) {
  out <- list()
  k <- 0L
  
  for (fe in config$models$fe_set) {
    fe_part <- make_fe_part(fe)
    
    for (cl in config$models$cluster_set) {
      cl_fml <- make_cluster(cl)
      
      for (ctrl in config$models$controls_set) {
        rhs_terms <- switch(
          ctrl,
          "x_only" = c("cumulative_missing_share"),
          "base" = c("cumulative_missing_share", "log1p(n_contracts)", "log1p(avg_contract_value)"),
          "base_extra" = c("cumulative_missing_share", "log1p(n_contracts)", "log1p(avg_contract_value)",
                           "log1p(total_contract_value)", "buyer_buyertype")
        )
        
        rhs_terms <- rhs_terms[rhs_terms %in% names(buyer_analysis_fe)]
        rhs <- paste(rhs_terms, collapse = " + ")
        fml <- stats::as.formula(paste0("cumulative_singleb_rate ~ ", rhs, " | ", fe_part))
        
        m <- safe_fixest(
          fixest::feglm(
            fml,
            family = quasibinomial(link = "logit"),
            data = buyer_analysis_fe,
            cluster = cl_fml
          )
        )
        
        if (is.null(m)) next
        
        eff <- extract_effect_fixest(
          model = m,
          x_name = "cumulative_missing_share",
          data_used = buyer_analysis_fe
        )
        
        eff_strength <- safe_fixest(effect_p10_p90(m, buyer_analysis_fe, "cumulative_missing_share"))
        if (is.null(eff_strength)) eff_strength <- NA_real_
        
        k <- k + 1L
        out[[k]] <- data.frame(
          outcome = "singleb",
          model_type = "fractional_logit",
          fe = fe,
          cluster = cl,
          controls = ctrl,
          estimate = eff$estimate,
          pvalue = eff$pvalue,
          nobs = eff$nobs,
          std_slope = eff$std_slope,
          effect_strength = eff_strength,
          stringsAsFactors = FALSE
        )
      }
    }
  }
  
  if (length(out) == 0) return(data.frame())
  do.call(rbind, out)
}

#' Analyze single bidding
#' 
#' @param df Data frame
#' @param config Configuration list
#' @return List of results
analyze_singleb <- function(df, config) {
  results <- list()
  
  if (!validate_required_columns(df, "ind_corr_singleb", "single-bidding analysis")) {
    return(results)
  }
  
  # Scale to 0-1
  df <- df %>%
    dplyr::mutate(
      ind_corr_singleb = dplyr::if_else(
        !is.na(ind_corr_singleb),
        ind_corr_singleb / 100,
        NA_real_
      )
    )
  
  # Filter by year
  yr <- config$years_singleb
  df_filtered <- df %>%
    dplyr::filter(
      !is.na(tender_year),
      tender_year >= yr$min_year,
      tender_year <= yr$max_year
    )
  
  if (nrow(df_filtered) == 0L) {
    message("No observations after year filter for single-bidding analysis")
    return(results)
  }
  
  # Compute buyer-level aggregates
  buyer_missing_share <- df_filtered %>%
    dplyr::group_by(buyer_masterid, tender_year) %>%
    summarise_missing_shares(cols = !dplyr::starts_with("ind_")) %>%
    dplyr::mutate(
      cumulative_missing_share = rowMeans(
        dplyr::across(dplyr::ends_with("_missing_share")),
        na.rm = TRUE
      )
    )
  
  buyer_singleb_rate <- df_filtered %>%
    dplyr::group_by(buyer_masterid, tender_year) %>%
    dplyr::summarise(
      cumulative_singleb_rate = mean(ind_corr_singleb, na.rm = TRUE),
      .groups = "drop"
    )
  
  buyer_controls <- df_filtered %>%
    dplyr::group_by(buyer_masterid, tender_year, buyer_buyertype) %>%
    dplyr::summarise(
      n_contracts = dplyr::n(),
      avg_contract_value = mean(bid_priceusd, na.rm = TRUE),
      total_contract_value = sum(bid_priceusd, na.rm = TRUE),
      .groups = "drop"
    )
  
  buyer_analysis <- buyer_missing_share %>%
    dplyr::select(buyer_masterid, tender_year, cumulative_missing_share) %>%
    dplyr::inner_join(buyer_singleb_rate, by = c("buyer_masterid", "tender_year")) %>%
    dplyr::inner_join(buyer_controls, by = c("buyer_masterid", "tender_year")) %>%
    dplyr::mutate(
      buyer_buyertype = forcats::fct_explicit_na(as.factor(buyer_buyertype), "Unknown")
    )
  
  # Filter to buyers with sufficient history
  buyer_year_span <- buyer_analysis %>%
    dplyr::group_by(buyer_masterid) %>%
    dplyr::summarise(n_years = dplyr::n_distinct(tender_year), .groups = "drop")
  
  eligible_buyers <- buyer_year_span %>%
    dplyr::filter(n_years >= config$thresholds$min_buyer_years) %>%
    dplyr::pull(buyer_masterid)
  
  buyer_analysis_fe <- buyer_analysis %>%
    dplyr::filter(buyer_masterid %in% eligible_buyers)
  
  if (nrow(buyer_analysis_fe) == 0L) {
    message("No buyers with sufficient years of data for single-bidding analysis")
    return(results)
  }
  
  results$data <- buyer_analysis_fe
  results$specs <- run_singleb_specs(buyer_analysis_fe, config)
  results$singleb_sensitivity <- build_sensitivity_bundle(results$specs)
  
  results
}

#' Analyze competition
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_competition <- function(df, config, output_dir) {
  results <- list()
  save_plot <- create_plot_saver(output_dir, config)
  
  # Buyer-supplier concentration
  conc <- analyze_buyer_supplier_concentration(df, config)
  if (length(conc) > 0) {
    results$concentration <- conc$data
    results$concentration_overall_plot <- conc$overall_plot
    results$concentration_yearly_plot <- conc$yearly_plot
    
    save_plot(conc$overall_plot, "bc_overall")
    save_plot(conc$yearly_plot, "bc_yearly")
  }
  
  # Single bidding
  singleb <- analyze_singleb(df, config)
  if (length(singleb) > 0) {
    results$singleb_data <- singleb$data
    results$singleb_specs <- singleb$specs
    results$singleb_sensitivity <- singleb$singleb_sensitivity  # FIXED!
    
    # Print sensitivity
    if (!is.null(singleb$singleb_sensitivity)) {
      message("\n--- SINGLEB sensitivity (", config$country, ") ---")
      print(singleb$singleb_sensitivity$overall)
      print(singleb$singleb_sensitivity$sign)
      print(singleb$singleb_sensitivity$by_fe)
      print(singleb$singleb_sensitivity$by_cluster)
      print(singleb$singleb_sensitivity$by_controls)
      print(singleb$singleb_sensitivity$classes)
      print(singleb$singleb_sensitivity$top_cells)
    }
    
    # Select and plot best model
    if (!is.null(singleb$specs) && nrow(singleb$specs) > 0) {
      best_row <- pick_best_model(
        singleb$specs,
        require_positive = TRUE,
        p_max = config$models$p_max,
        strength_col = "effect_strength"
      )
      
      if (!is.null(best_row)) {
        # Refit model
        fe_part <- make_fe_part(best_row$fe)
        cl_fml <- make_cluster(best_row$cluster)
        
        rhs_terms <- switch(
          best_row$controls,
          "x_only" = c("cumulative_missing_share"),
          "base" = c("cumulative_missing_share", "log1p(n_contracts)", "log1p(avg_contract_value)"),
          "base_extra" = c("cumulative_missing_share", "log1p(n_contracts)", "log1p(avg_contract_value)",
                           "log1p(total_contract_value)", "buyer_buyertype")
        )
        rhs <- paste(rhs_terms, collapse = " + ")
        fml <- stats::as.formula(paste0("cumulative_singleb_rate ~ ", rhs, " | ", fe_part))
        
        model <- fixest::feglm(
          fml,
          family = quasibinomial(link = "logit"),
          data = singleb$data,
          cluster = cl_fml
        )
        
        pred <- tryCatch(
          ggeffects::ggpredict(model, terms = "cumulative_missing_share"),
          error = function(e) NULL
        )
        
        if (!is.null(pred)) {
          years_used <- range(singleb$data$tender_year, na.rm = TRUE)
          
          results$singleb_plot <- plot_ggeffects_line(
            pred = pred,
            title = paste("Predicted Single-Bidding by Missing Share –", config$country),
            subtitle = paste0(
              "(BEST) Model: ", pretty_model_name(best_row$model_type),
              " | Years: ", years_used[1], "–", years_used[2],
              " | N=", best_row$nobs,
              " | FE: ", pretty_fe_label(best_row$fe),
              " | Cluster: ", best_row$cluster,
              " | ", fe_counts_note(singleb$data, best_row$fe)
            ),
            x_lab = "Buyer Missing Share",
            y_lab = "Predicted Single-Bidding Share",
            caption = paste0(
              "Filters: tender_year in [", config$years_singleb$min_year, ", ",
              config$years_singleb$max_year,
              "]; buyers with ≥", config$thresholds$min_buyer_years, " years of data. ",
              controls_note(best_row$controls)
            )
          )
          
          save_plot(results$singleb_plot, "reg_singleb_missing_share")
        }
      }
    }
  }
  
  results
}

# ========================================================================
# PART 14: MODULE - MARKET ANALYSIS
# ========================================================================

#' Detect unusual market entries
#' 
#' @param cpv_data CPV cluster data
#' @param config Configuration list
#' @return Data frame of unusual entries
detect_unusual_entries <- function(cpv_data, config) {
  # Bidder-CPV statistics
  bidder_cpv_stats <- cpv_data %>%
    dplyr::group_by(bidder_id, cpv_cluster) %>%
    dplyr::summarise(
      n_awards = dplyr::n(),
      total_value = sum(bid_priceusd, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    dplyr::group_by(bidder_id) %>%
    dplyr::mutate(
      bidder_total_awards = sum(n_awards),
      bidder_total_value = sum(total_value),
      share_awards = n_awards / bidder_total_awards,
      share_value = ifelse(
        bidder_total_value > 0,
        total_value / bidder_total_value,
        NA_real_
      )
    ) %>%
    dplyr::ungroup()
  
  # Flag atypical clusters
  df_with_flags <- cpv_data %>%
    dplyr::left_join(
      bidder_cpv_stats %>%
        dplyr::select(bidder_id, cpv_cluster, n_awards, bidder_total_awards, share_awards, share_value),
      by = c("bidder_id", "cpv_cluster")
    ) %>%
    dplyr::mutate(
      enough_history = bidder_total_awards >= config$thresholds$min_history_threshold,
      cpv_cluster_atypical = enough_history &
        share_awards < config$thresholds$marginal_share_threshold &
        n_awards <= config$thresholds$max_wins_atypical
    )
  
  # Cluster-supplier counts
  cluster_supplier_counts <- cpv_data %>%
    dplyr::group_by(cpv_cluster, bidder_id) %>%
    dplyr::summarise(n_ic = dplyr::n(), .groups = "drop")
  
  cluster_totals <- cpv_data %>%
    dplyr::group_by(cpv_cluster) %>%
    dplyr::summarise(
      n_c = dplyr::n(),
      n_suppliers_c = dplyr::n_distinct(bidder_id),
      .groups = "drop"
    )
  
  # Surprise scores
  cluster_surprise <- cluster_supplier_counts %>%
    dplyr::left_join(cluster_totals, by = "cpv_cluster") %>%
    dplyr::mutate(
      p_i_given_c = (n_ic + 1) / (n_c + n_suppliers_c),
      surprise_score = -log(p_i_given_c)
    )
  
  df_with_surprise <- df_with_flags %>%
    dplyr::left_join(
      cluster_surprise %>% dplyr::select(cpv_cluster, bidder_id, surprise_score),
      by = c("cpv_cluster", "bidder_id")
    ) %>%
    dplyr::mutate(
      surprise_z = ifelse(
        !is.na(surprise_score),
        (surprise_score - mean(surprise_score, na.rm = TRUE)) / sd(surprise_score, na.rm = TRUE),
        NA_real_
      )
    )
  
  # Home market
  home_market <- bidder_cpv_stats %>%
    dplyr::group_by(bidder_id) %>%
    dplyr::slice_max(n_awards, n = 1, with_ties = FALSE) %>%
    dplyr::ungroup() %>%
    dplyr::transmute(bidder_id, home_cpv_cluster = cpv_cluster)
  
  # Unusual entries
  df_unusual <- df_with_surprise %>%
    dplyr::left_join(home_market, by = "bidder_id") %>%
    dplyr::mutate(target_cpv_cluster = cpv_cluster) %>%
    dplyr::filter(enough_history, cpv_cluster_atypical, !is.na(surprise_z))
  
  df_unusual
}

#' Build unusual entry matrix
#' 
#' @param df_unusual Unusual entries data
#' @param config Configuration list
#' @return Data frame of unusual entry flows
build_unusual_matrix <- function(df_unusual, config) {
  df_unusual %>%
    dplyr::group_by(home_cpv_cluster, target_cpv_cluster) %>%
    dplyr::summarise(
      n_bidders = dplyr::n_distinct(bidder_id),
      n_awards = dplyr::n(),
      mean_surprise = mean(surprise_z, na.rm = TRUE),
      .groups = "drop"
    )
}

#' Analyze markets
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_markets <- function(df, config, output_dir) {
  results <- list()
  save_plot <- create_plot_saver(output_dir, config)
  
  required_cols <- c("tender_id", "lot_number", "bidder_id", "lot_productcode", "bid_priceusd")
  
  if (!validate_required_columns(df, required_cols, "market analysis")) {
    return(results)
  }
  
  # Build CPV data
  results$cpv_data <- build_cpv_df(df, config$thresholds$cpv_digits)
  
  # Detect unusual entries
  df_unusual <- detect_unusual_entries(results$cpv_data, config)
  
  if (nrow(df_unusual) == 0) {
    message("No unusual market entries detected")
    return(results)
  }
  
  results$unusual_entries <- df_unusual
  results$unusual_matrix <- build_unusual_matrix(df_unusual, config)
  
  # Network plot
  edges_markets <- results$unusual_matrix %>%
    dplyr::filter(n_bidders >= config$thresholds$min_bidders_for_edge) %>%
    dplyr::rename(from = home_cpv_cluster, to = target_cpv_cluster)
  
  if (nrow(edges_markets) > 0) {
    g_markets <- igraph::graph_from_data_frame(edges_markets, directed = TRUE)
    
    # In analyze_markets function, update network plot:
    results$network_plot <- ggraph::ggraph(g_markets, layout = "fr") +
      ggraph::geom_edge_link(
        ggplot2::aes(width = n_bidders, colour = mean_surprise),
        arrow = grid::arrow(length = grid::unit(0.15, "cm")),
        end_cap = ggraph::circle(2, "mm"),
        alpha = 0.6
      ) +
      ggraph::geom_node_point(size = PLOT_SIZES$point_size + 1, colour = "darkorange") +
      ggraph::geom_node_text(ggplot2::aes(label = name), repel = TRUE, size = PLOT_SIZES$geom_text_size) +
      ggraph::scale_edge_width(range = c(0.2, 2)) +
      ggraph::scale_edge_colour_continuous(
        low = "green",
        high = "red",
        name = "Mean surprise score (standardised)"
      ) +
      ggplot2::labs(
        title = paste("Network of unusual market entries –", config$country),
        subtitle = "Nodes represent CPV clusters; edges show atypical flow of bidders"
      ) +
      ggplot2::theme_void(base_size = PLOT_SIZES$base_size) +
      ggplot2::theme(
        plot.title = ggplot2::element_text(size = PLOT_SIZES$title_size, face = "bold"),
        plot.subtitle = ggplot2::element_text(size = PLOT_SIZES$subtitle_size),
        legend.text = ggplot2::element_text(size = PLOT_SIZES$legend_text_size),
        legend.title = ggplot2::element_text(size = PLOT_SIZES$legend_title_size)
      )
  }
  
  # Supplier-level unusual behavior
  results$supplier_unusual <- df_unusual %>%
    dplyr::group_by(bidder_id, home_cpv_cluster) %>%
    dplyr::summarise(
      max_surprise_z = max(surprise_z, na.rm = TRUE),
      mean_surprise_z = mean(surprise_z, na.rm = TRUE),
      n_atypical_markets = dplyr::n_distinct(target_cpv_cluster),
      n_atypical_awards = dplyr::n(),
      .groups = "drop"
    )
  
  top_suppliers <- results$supplier_unusual %>%
    dplyr::filter(n_atypical_awards >= 3) %>%
    dplyr::arrange(dplyr::desc(max_surprise_z)) %>%
    dplyr::slice_head(n = config$thresholds$top_n_suppliers)
  
  results$supplier_unusual_plot <- ggplot2::ggplot(
    top_suppliers,
    ggplot2::aes(
      x = reorder(bidder_id, max_surprise_z),
      y = max_surprise_z,
      fill = n_atypical_markets
    )
  ) +
    ggplot2::geom_col() +
    ggplot2::coord_flip() +
    ggplot2::scale_fill_viridis_c(option = "C", name = "No. atypical\nmarkets") +
    ggplot2::labs(
      title = paste("Top suppliers by unusual market behaviour –", config$country),
      subtitle = "Max standardised surprise score across atypical CPV entries",
      x = "Supplier ID",
      y = "Max surprise z-score"
    ) +
    standard_plot_theme()
  
  save_plot(results$supplier_unusual_plot, "supp_unusual_suppliers")
  
  # Market-level unusual entries
  results$market_unusual <- df_unusual %>%
    dplyr::group_by(target_cpv_cluster) %>%
    dplyr::summarise(
      mean_surprise_z = mean(surprise_z, na.rm = TRUE),
      max_surprise_z = max(surprise_z, na.rm = TRUE),
      n_unusual_suppliers = dplyr::n_distinct(bidder_id),
      n_unusual_awards = dplyr::n(),
      .groups = "drop"
    ) %>%
    dplyr::mutate(market_risk_index = mean_surprise_z * log1p(n_unusual_suppliers))
  
  top_markets <- results$market_unusual %>%
    dplyr::slice_max(market_risk_index, n = config$thresholds$top_n_markets)
  
  results$market_unusual_plot <- ggplot2::ggplot(
    top_markets,
    ggplot2::aes(
      x = n_unusual_suppliers,
      y = mean_surprise_z,
      size = n_unusual_awards,
      colour = mean_surprise_z,
      label = target_cpv_cluster
    )
  ) +
    ggplot2::geom_point(alpha = 0.85) +
    ggrepel::geom_text_repel(size = PLOT_SIZES$geom_text_size, colour = "gray20") +
    ggplot2::scale_size_continuous(name = "No. atypical awards", range = c(3, 12)) +
    ggplot2::scale_colour_gradientn(
      colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"),
      name = "Mean surprise (z-score)"
    ) +
    ggplot2::labs(
      title = paste("Markets attracting unusual supplier entries –", config$country),
      subtitle = "Each point is a CPV cluster; size = awards; color = surprise",
      x = "Number of suppliers entering atypically",
      y = "Mean surprise z-score"
    ) +
    standard_plot_theme()
  
  save_plot(results$market_unusual_plot, "supp_unusual_markets")
  
  results
}

# ========================================================================
# PART 15: MODULE - PRICE ANALYSIS
# ========================================================================

#' Prepare price data
#' 
#' @param df Data frame
#' @param config Configuration list
#' @return Prepared data frame
prepare_price_data <- function(df, config) {
  if (!validate_required_columns(df, c("bid_price", "lot_estimatedprice"), "price analysis")) {
    return(NULL)
  }
  
  # Determine value variable
  val_var <- dplyr::case_when(
    "bid_priceusd" %in% names(df) ~ "bid_priceusd",
    "bid_price" %in% names(df) ~ "bid_price",
    TRUE ~ NA_character_
  )
  
  # Compute total missing share
  key_vars <- names(label_lookup) %>% gsub("_missing_share", "", .)
  df <- df %>%
    dplyr::mutate(
      total_missing_share = rowMeans(
        dplyr::across(all_of(key_vars), ~ is.na(.)),
        na.rm = TRUE
      )
    )
  
  # Add log contract value
  if (!is.na(val_var)) {
    df <- df %>%
      dplyr::mutate(log_contract_value = log1p(.data[[val_var]]))
  }
  
  # Filter by year
  yrp <- config$years_relprice
  rel_price_data <- df %>%
    dplyr::mutate(
      relative_price = bid_price / lot_estimatedprice,
      relative_price = ifelse(
        relative_price > config$thresholds$max_relative_price |
          relative_price <= config$thresholds$min_relative_price,
        NA,
        relative_price
      )
    ) %>%
    dplyr::filter(
      !is.na(tender_year),
      tender_year >= yrp$min_year,
      tender_year <= yrp$max_year
    )
  
  # Relevel factors
  if ("tender_proceduretype" %in% names(rel_price_data)) {
    rel_price_data <- rel_price_data %>%
      dplyr::mutate(
        tender_proceduretype = stats::relevel(as.factor(tender_proceduretype), ref = "OPEN")
      )
  }
  if ("buyer_buyertype" %in% names(rel_price_data)) {
    rel_price_data <- rel_price_data %>%
      dplyr::mutate(
        buyer_buyertype = stats::relevel(as.factor(buyer_buyertype), ref = "UTILITIES")
      )
  }
  
  rel_price_data
}


#' Run relative price specifications
#' 
#' @param rel_price_data Price data
#' @param config Configuration list
#' @return Data frame of specification results
run_relprice_specs <- function(rel_price_data, config) {
  out <- list()
  k <- 0L
  
  fe_set <- c("0", config$models$fe_set)
  
  for (mt in config$models$model_types_relprice) {
    for (fe in fe_set) {
      fe_part <- make_fe_part(fe)
      
      for (cl in config$models$cluster_set) {
        if (fe == "0" && cl == "none") next
        
        cl_fml <- make_cluster(cl)
        
        for (ctrl in config$models$controls_set) {
          rhs_terms <- switch(
            ctrl,
            "x_only" = c("total_missing_share"),
            "base" = c("total_missing_share", "log_contract_value",
                       "buyer_buyertype", "tender_proceduretype")
          )
          
          rhs_terms <- rhs_terms[rhs_terms %in% names(rel_price_data)]
          
          # ADDED: Skip if no valid terms
          if (length(rhs_terms) == 0) {
            message("Skipping spec: no valid RHS terms for controls=", ctrl)
            next
          }
          
          rhs <- paste(rhs_terms, collapse = " + ")
          
          m <- NULL
          d_used <- NULL
          x_name <- "total_missing_share"
          
          if (mt == "ols_level") {
            fml <- stats::as.formula(paste0("relative_price ~ ", rhs, " | ", fe_part))
            m <- safe_fixest(fixest::feols(fml, data = rel_price_data, cluster = cl_fml))
            d_used <- rel_price_data
            
          } else if (mt == "ols_log") {
            d_used <- rel_price_data %>%
              dplyr::mutate(log_relative_price = log(relative_price))
            d_used$log_relative_price[!is.finite(d_used$log_relative_price)] <- NA_real_
            d_used <- d_used[!is.na(d_used$log_relative_price), , drop = FALSE]
            if (nrow(d_used) == 0) next
            
            fml <- stats::as.formula(paste0("log_relative_price ~ ", rhs, " | ", fe_part))
            m <- safe_fixest(fixest::feols(fml, data = d_used, cluster = cl_fml))
            
          } else if (mt == "gamma_log") {
            fml <- stats::as.formula(paste0("relative_price ~ ", rhs, " | ", fe_part))
            m <- safe_fixest(fixest::feglm(
              fml, data = rel_price_data, family = Gamma(link = "log"), cluster = cl_fml
            ))
            d_used <- rel_price_data
          }
          
          if (is.null(m)) next
          
          # ADDED: Check if x_name is actually in the model
          if (!(x_name %in% names(coef(m)))) {
            message("Skipping spec: ", x_name, " not in model coefficients")
            next
          }
          
          eff <- extract_effect_fixest(model = m, x_name = x_name, data_used = d_used)
          
          k <- k + 1L
          out[[k]] <- data.frame(
            outcome = "rel_price",
            model_type = mt,
            fe = fe,
            cluster = cl,
            controls = ctrl,
            estimate = eff$estimate,
            pvalue = eff$pvalue,
            nobs = eff$nobs,
            std_slope = eff$std_slope,
            stringsAsFactors = FALSE
          )
        }
      }
    }
  }
  
  if (length(out) == 0) return(data.frame())
  do.call(rbind, out)
}

#' Analyze prices
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_prices <- function(df, config, output_dir) {
  results <- list()
  
  # Create plot saver
  save_plot <- create_plot_saver(output_dir, config)
  
  results$data <- prepare_price_data(df, config)
  
  if (is.null(results$data) || nrow(results$data) == 0) {
    message("No data available for price analysis")
    return(results)
  }
  
  results$specs <- run_relprice_specs(results$data, config)
  
  # Check if specs were generated
  if (is.null(results$specs) || nrow(results$specs) == 0) {
    message("No relative price specifications generated for ", config$country)
    return(results)
  }
  
  results$relprice_sensitivity <- build_sensitivity_bundle(results$specs)
  
  # Print sensitivity
  if (!is.null(results$relprice_sensitivity)) {
    message("\n--- REL_PRICE sensitivity (", config$country, ") ---")
    print(results$relprice_sensitivity$overall)
    print(results$relprice_sensitivity$sign)
    print(results$relprice_sensitivity$by_fe)
    print(results$relprice_sensitivity$by_cluster)
    print(results$relprice_sensitivity$by_controls)
    print(results$relprice_sensitivity$classes)
    print(results$relprice_sensitivity$top_cells)
  }
  
  # Select and plot best model
  best_row <- pick_best_model(
    results$specs,
    require_positive = TRUE,
    p_max = config$models$p_max,
    strength_col = "std_slope"
  )
  
  if (is.null(best_row)) {
    message("No relative price model met selection criteria for ", config$country)
    return(results)
  }
  
  message("Relative price BEST spec for ", config$country, ": ",
          "model=", best_row$model_type,
          ", fe=", best_row$fe, 
          ", cluster=", best_row$cluster,
          ", controls=", best_row$controls,
          ", p=", signif(best_row$pvalue, 3))
  
  # Refit model
  fe_part <- make_fe_part(best_row$fe)
  cl_fml <- make_cluster(best_row$cluster)
  
  rhs_terms <- switch(
    best_row$controls,
    "x_only" = c("total_missing_share"),
    "base" = c("total_missing_share", "log_contract_value",
               "buyer_buyertype", "tender_proceduretype")
  )
  rhs_terms <- rhs_terms[rhs_terms %in% names(results$data)]
  rhs <- paste(rhs_terms, collapse = " + ")
  
  model <- NULL
  pred <- NULL
  y_lab <- NULL
  
  if (best_row$model_type == "ols_level") {
    fml <- stats::as.formula(paste0("relative_price ~ ", rhs, " | ", fe_part))
    model <- fixest::feols(fml, data = results$data, cluster = cl_fml)
    pred <- tryCatch(
      ggeffects::ggpredict(model, terms = "total_missing_share"),
      error = function(e) {
        message("Error in ggpredict: ", e$message)
        NULL
      }
    )
    y_lab <- "Predicted Relative Price"
    
  } else if (best_row$model_type == "ols_log") {
    d <- results$data %>% dplyr::mutate(log_relative_price = log(relative_price))
    d$log_relative_price[!is.finite(d$log_relative_price)] <- NA_real_
    d <- d[!is.na(d$log_relative_price), , drop = FALSE]
    
    if (nrow(d) > 0) {
      fml <- stats::as.formula(paste0("log_relative_price ~ ", rhs, " | ", fe_part))
      model <- fixest::feols(fml, data = d, cluster = cl_fml)
      pred <- tryCatch(
        ggeffects::ggpredict(model, terms = "total_missing_share"),
        error = function(e) {
          message("Error in ggpredict: ", e$message)
          NULL
        }
      )
      y_lab <- "Predicted log(Relative Price)"
    }
    
  } else if (best_row$model_type == "gamma_log") {
    fml <- stats::as.formula(paste0("relative_price ~ ", rhs, " | ", fe_part))
    model <- fixest::feglm(
      fml, data = results$data, family = Gamma(link = "log"), cluster = cl_fml
    )
    pred <- tryCatch(
      ggeffects::ggpredict(model, terms = "total_missing_share"),
      error = function(e) {
        message("Error in ggpredict: ", e$message)
        NULL
      }
    )
    y_lab <- "Predicted Relative Price (Gamma-log)"
  }
  
  if (!is.null(pred)) {
    years_used <- range(results$data$tender_year, na.rm = TRUE)
    
    results$rel_price_plot <- plot_ggeffects_line(
      pred = pred,
      title = paste("Predicted Relative Price by Missing Share –", config$country),
      subtitle = paste0(
        "(BEST) Model: ", pretty_model_name(best_row$model_type),
        " | Years: ", years_used[1], "–", years_used[2],
        " | N=", best_row$nobs,
        " | FE: ", pretty_fe_label(best_row$fe),
        " | Cluster: ", best_row$cluster,
        " | Controls: ", pretty_controls_label(best_row$controls)
      ),
      x_lab = "Total Missing Share",
      y_lab = y_lab,
      caption = paste0(
        "Filters: tender_year in [", config$years_relprice$min_year, ", ",
        config$years_relprice$max_year, "]. ",
        controls_note(best_row$controls)
      )
    )
    
    # Save the plot
    tryCatch({
      save_plot(results$rel_price_plot, "reg_relative_price_missing_share")
      message("✓ Relative price plot saved successfully for ", config$country)
    }, error = function(e) {
      message("✗ Error saving relative price plot: ", e$message)
      # Fallback: save directly
      ggplot2::ggsave(
        filename = file.path(output_dir, paste0("reg_relative_price_missing_share_", config$country, ".png")),
        plot = results$rel_price_plot,
        width = 10,
        height = 8,
        dpi = 300
      )
      message("✓ Saved plot using fallback method")
    })
    
  } else {
    message("Could not generate predictions for relative price plot")
  }
  
  results
}

# ========================================================================
# PART 16: MODULE - REGIONAL ANALYSIS
# ========================================================================

#' Analyze regional patterns (NUTS3)
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of results
analyze_regional <- function(df, config, output_dir) {
  results <- list()
  save_plot <- create_plot_saver(output_dir, config)
  
  if (!validate_required_columns(df, "buyer_nuts", "regional analysis")) {
    return(results)
  }
  
  df_nuts <- df %>% clean_nuts3(buyer_nuts)
  
  results$missing_by_region <- df_nuts %>%
    dplyr::filter(!is.na(nuts3)) %>%
    dplyr::select(!dplyr::starts_with("ind_")) %>%
    dplyr::group_by(nuts3) %>%
    dplyr::summarise(
      missing_share = mean(is.na(dplyr::cur_data_all()), na.rm = TRUE),
      .groups = "drop"
    )
  
  # Try to get map data
  region_sf <- tryCatch(
    {
      eurostat::get_eurostat_geospatial(
        output_class = "sf",
        nuts_level = 3,
        year = 2021
      ) %>%
        dplyr::filter(CNTR_CODE == config$country)
    },
    error = function(e) {
      message("Could not fetch NUTS3 map data: ", e$message)
      NULL
    }
  )
  
  if (!is.null(region_sf) && nrow(region_sf) > 0) {
    results$region_map_data <- region_sf %>%
      dplyr::left_join(results$missing_by_region, by = c("NUTS_ID" = "nuts3"))
    
    results$region_missing_map <- ggplot2::ggplot(results$region_map_data) +
      ggplot2::geom_sf(ggplot2::aes(fill = missing_share), color = "grey40", size = 0.2) +
      ggplot2::scale_fill_distiller(
        palette = "Blues",
        direction = 1,
        na.value = "grey90",
        labels = scales::percent_format(accuracy = 1),
        name = "Missing share"
      ) +
      ggplot2::labs(
        title = paste("Overall share of missing values by region (NUTS3) –", config$country)
      ) +
      ggplot2::theme_minimal()
    
    save_plot(results$region_missing_map, "region_missing_map")
  }
  
  results
}

# ========================================================================
# PART 17: SUMMARY STATISTICS MODULE
# ========================================================================

#' Log summary statistics
#' 
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return List of summary stats
log_summary_stats <- function(df, config, output_dir) {
  stats <- list()
  
  # Observations per year
  stats$n_obs_per_year <- df %>%
    dplyr::count(tender_year, name = "n_observations")
  
  # Unique entities
  stats$n_unique_buyers <- if ("buyer_masterid" %in% names(df)) {
    dplyr::n_distinct(df$buyer_masterid, na.rm = TRUE)
  } else {
    NA_integer_
  }
  
  stats$n_unique_bidders <- if ("bidder_masterid" %in% names(df)) {
    dplyr::n_distinct(df$bidder_masterid, na.rm = TRUE)
  } else {
    NA_integer_
  }
  
  # Tenders per year
  stats$tender_year_tenders <- if ("tender_id" %in% names(df)) {
    df %>%
      dplyr::group_by(tender_year) %>%
      dplyr::summarise(
        n_unique_tender_id = dplyr::n_distinct(tender_id),
        .groups = "drop"
      )
  } else {
    NULL
  }
  
  # Variables present
  stats$vars_present <- names(df)[!startsWith(names(df), "ind_")]
  
  # Print summary
  cat("\n", strrep("=", 70), "\n")
  cat("DATA SUMMARY FOR", config$country, "\n")
  cat(strrep("=", 70), "\n\n")
  
  cat("Contracts per year:\n")
  print(stats$n_obs_per_year)
  cat("\n")
  
  if (!is.na(stats$n_unique_buyers)) {
    cat("Number of unique buyers (buyer_masterid):", stats$n_unique_buyers, "\n\n")
  }
  
  if (!is.na(stats$n_unique_bidders)) {
    cat("Number of unique bidders (bidder_masterid):", stats$n_unique_bidders, "\n\n")
  }
  
  if (!is.null(stats$tender_year_tenders)) {
    cat("Number of unique tenders per year:\n")
    print(stats$tender_year_tenders)
    cat("\n")
  }
  
  cat("Variables present (excluding indicators):\n")
  cat(paste(stats$vars_present, collapse = ", "), "\n")
  cat("\n", strrep("=", 70), "\n\n")
  
  invisible(stats)
}

# ========================================================================
# PART 18: SAFE MODULE EXECUTION
# ========================================================================

#' Safely run analysis module
#' 
#' @param module_fn Module function
#' @param df Data frame
#' @param config Configuration list
#' @param output_dir Output directory
#' @return Module results or error info
safely_run_module <- function(module_fn, df, config, output_dir) {
  module_name <- deparse(substitute(module_fn))
  
  tryCatch(
    {
      message("\n", strrep("-", 60))
      message("Running module: ", module_name)
      message(strrep("-", 60))
      
      result <- module_fn(df, config, output_dir)
      
      message("✓ Module completed successfully: ", module_name)
      result
    },
    error = function(e) {
      message("✗ Module failed: ", module_name)
      message("Error: ", e$message)
      list(error = e$message, status = "failed")
    }
  )
}

# ========================================================================
# PART 19: ENSURE OUTPUT DIRECTORY
# ========================================================================

#' Ensure output directory exists
#' 
#' @param output_dir Output directory path
ensure_output_directory <- function(output_dir) {
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
    message("Created output directory: ", output_dir)
  }
}

# ========================================================================
# PART 20: MAIN PIPELINE
# ========================================================================

#' Run complete integrity analysis pipeline
#' 
#' @param df Data frame with procurement data
#' @param country_code Two-letter country code
#' @param output_dir Directory for outputs
#' @return List of analysis results
#' @export
run_integrity_pipeline <- function(df, country_code = "GEN", output_dir) {
  
  # Initialize
  message("\n", strrep("=", 70))
  message("PROCUREMENT INTEGRITY ANALYSIS PIPELINE")
  message("Country: ", country_code)
  message(strrep("=", 70), "\n")
  
  config <- create_pipeline_config(country_code)
  ensure_output_directory(output_dir)
  
  # Prepare data
  df <- prepare_data(df)
  data_quality <- check_data_quality(df, config)
  
  # Log summary statistics
  summary_stats <- log_summary_stats(df, config, output_dir)
  
  # Run analysis modules
  results <- list(
    config = config,
    data_quality = data_quality,
    summary_stats = summary_stats,
    missing = safely_run_module(analyze_missing_values, df, config, output_dir),
    interoperability = safely_run_module(analyze_interoperability, df, config, output_dir),
    competition = safely_run_module(analyze_competition, df, config, output_dir),
    markets = safely_run_module(analyze_markets, df, config, output_dir),
    prices = safely_run_module(analyze_prices, df, config, output_dir),
    regional = safely_run_module(analyze_regional, df, config, output_dir)
  )
  
  # Final message
  message("\n", strrep("=", 70))
  message("PIPELINE COMPLETED")
  message("Results saved to: ", output_dir)
  message(strrep("=", 70), "\n")
  
  invisible(results)
}

# ========================================================================
# END OF SCRIPT
# ========================================================================